I have a pandas dataframe containing 100 rows. It looks something like this
id_number Name Age Salary
00001 Alice 50 6.2234
00002 John 29 9.1
.
.
.
00098 Susan 36 11.58
00099 Remy 50 3.7
00100 Walter 50 5.52
From this dataframe, I want to extract the rows corresponding to individuals whose ID numbers do NOT lie between 11 and 20. I want rows 0 to 9, and 20 to 99.
df.iloc allows extracting a continuous set of rows, like 20 to 99, but not 0 to 9 and 20 to 99 in the same go.
I also tried df[(df['id_number'] >= 20) & (df['id_number'] < 10)] but that returns an empty dataframe.
Is there a straightforward way to do this, that does not require doing two separate extractions and their concatenation?
drop sliced index is what you need
in this case we are slicing 11 through to 20
Data
df=pd.Series(np.arange(1,101))
df
drop slice
df.drop(df.loc[11:20].index, inplace=True)
df
This seemed to work (suggested by #FarhoodET):
df[(df['id_number'] => 20) | (df['id_number'] < 10)]
Related
I want to make a different dataframe for those Number(Column B) where Main Date > Reported Date (see the below image). If this condition comes true then I have to make other dataframe displaying that Number Data.
Example
:- if take Number (column B) 223311, now if any main date > Reported Date, then display all the records of that Number
Here is a simple solution with Pandas. You can separate out Dataframes very easily by column values of a particular column. From there, iterate the new Dataframe, resetting for index (if you want to keep the index, use dataframe.shape instead). I appended them to a list for convenience, which could be easily extracted into labeled dataframes, or combined. Long variable names are to help comprehension.
df = pd.read_csv('forstack.csv')
list_of_dataframes = [] #A place to store each dataframe. You could also name them as you go
checked_Numbers = [] #Simply to avoid multiple of same dataframe
for aNumber in df['Number']: #For every number in the column "Number"
if(aNumber not in checked_Numbers): #While this number has not been processed
checked_Numbers.append(aNumber) #Mark as checked
df_forThisNumber = df[df.Number == aNumber].reset_index(drop=True) #"Make a different Dataframe" Per request, with new index
for index in range(0,len(df_forThisNumber)): #Parse each element of this dataframe to see if it matches criteria
if(df_forThisNumber.at[index,'Main Date'] > df_forThisNumber.at[index,'Reported Date']):
list_of_dataframes.append(df_forThisNumber) #If it matches the criteria, append it
Outputs :
Main Date Number Reported Date Fee Amount Cost Name
0 1/1/2019 223311 1/1/2019 100 12 20 11
1 1/7/2019 223311 1/1/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/2/2019 111111 1/2/2019 100 12 20 11
1 1/6/2019 111111 1/2/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/3/2019 222222 1/3/2019 100 12 20 11
1 1/8/2019 222222 1/3/2019 100 12 20 11
I have three dataframes with row counts more than 71K. Below are the samples.
df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})
Edit
As suggested, below is my desired out put
df_final
Device_ID Col_A Col_B Col_C
1001 45 35 5
1034 56 46 14
1223 78 78 8
1001 33 33 13
1887 Nan 66 16
1223 NaN NaN 8
While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.
I am aware of reduce method, but my suspect is it may lead to same situation.
Is there any better and efficient way?
To get your desired outcome, you can use this:
result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')
If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).
I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.
I have a dataframe with two "categories" of information. One category is repeated across multiple rows, and the other is specific to each row.
It looks something like this:
City State Industry Pay Hours
15 10 1 20 40
15 10 2 30 25
20 10 1 25 30
20 10 2 50 80
I want it to look like:
City State Industry1Pay Industry1Hours Industry2Pay Industry2Hours
15 10 20 40 30 25
20 10 25 30 50 80
This is a simplified version because the full table is much too long to fit up there. There are 8 columns in place of city and state, and 2 additional columns to pay and hours. In addition, each row should contain 4 industries for now (it will be 5 once that data comes in).
I am really struggling with how to do this. The dataset is from a project conducted in Stata, so the columns are mostly floats and need to stay that way for when I send it in.
The closest I think I've gotten is
wage = wage.pivot_table(index='cityid', columns='Industry').rename_axis(None)
wage.columns = wage.columns.map('_'.join)
but I get an error because you can't join a float to a string, and I suspect that this will not work the way I'm hoping it will regardless.
So far I've looked at quite a few stackoverflow questions, as well as:
https://hackernoon.com/reshaping-data-in-python-fa27dda2ff77
http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/
and two others I am unable to link to because I haven't used stackoverflow very much
I'm really struggling with this, and would appreciate any help, even a link to a good tutorial to wrap my head around this. It seems like a really simple task but for the life of me I can't figure out how to do it without just manually moving stuff around in Excel.
I apologize in advance if this is a duplicate - I looked around a lot but I might be missing something obvious because I'm not sure what this is called beyond reshaping.
Let's use set_index and unstack:
df['Industry'] = 'Industry'+df.Industry.astype(str)
df_out = df.set_index(['City','State','Industry']).unstack()
And flatten the multiindex columns with swaplevel, map, join:
df_out.columns = df_out.columns.swaplevel(1,0)
df_out.columns = df_out.columns.map(''.join)
Output:
Industry1Pay Industry2Pay Industry1Hours Industry2Hours
City State
15 10 20 30 40 25
20 10 25 50 30 80
In case of multiple values of City, State, and Industry then use pivot_table
df['Industry'] = 'Industry'+df.Industry.astype(str)
df_out = df.pivot_table(index=['City','State'],columns='Industry',values=['Pay','Hours'], aggfunc='sum')
df_out.columns = df_out.columns.swaplevel(1,0)
df_out.columns = df_out.columns.map(''.join)
df_out
OR use groupby
df_out = df.groupby(['City','State','Industry'])['Pay','Hours'].sum().unstack()
df_out.columns = df_out.columns.swaplevel(1,0)
df_out.columns = df_out.columns.map(''.join)
df_out
Output:
Industry1Pay Industry2Pay Industry1Hours Industry2Hours
City State
15 10 20 30 40 25
20 10 25 50 30 80
Here's how to use pivot
In [38]: df.pivot_table(index=['City', 'State'],
columns='Industry',
values=['Pay', 'Hours'])
Out[38]:
Pay Hours
Industry 1 2 1 2
City State
15 10 20 30 40 25
20 10 25 50 30 80
To flatten the pivot and add column names.
In [94]: dff = df.pivot_table(index=['City', 'State'], columns='Industry',
values=['Pay', 'Hours'])
In [95]: cols = ['Industry%s%s' % x for x in zip(dff.columns.get_level_values(1),
dff.columns.get_level_values(0))]
In [96]: cols
Out[96]: ['Industry1Pay', 'Industry2Pay', 'Industry1Hours', 'Industry2Hours']
In [97]: dff.columns = cols
In [98]: dff.reset_index()
Out[98]:
City State Industry1Pay Industry2Pay Industry1Hours Industry2Hours
0 15 10 20 30 40 25
1 20 10 25 50 30 80
I originally have 3 columns, timestamp,response_time and type columns, what I need to do is find the mean of response time where all timestamps are same hence I grouped all timestamps together and applied mean function on them. I got the following series which is fine:
0 16.949689
1 17.274615
2 16.858884
3 17.025155
4 17.062008
5 16.846885
6 17.172994
7 17.025797
8 17.001974
9 16.924636
10 16.813300
11 17.152066
12 17.291899
13 16.946970
14 16.972884
15 16.871824
16 16.840024
17 17.227682
18 17.288211
19 17.370553
20 17.395759
21 17.449579
22 17.340357
23 17.137308
24 16.981012
25 16.946727
26 16.947073
27 16.830850
28 17.366538
29 17.054468
30 16.823983
31 17.115429
32 16.859003
33 16.919645
34 17.351895
35 16.930233
36 17.025194
37 16.824997
And I need to be able to plot column 1 vs column 2, but I am not abel to extract them seperately.
I obtained this column by doing groupby('timestamp') and then a mean() on that.
The problem I need to solve is how to extract each column of this series? or is there a better way to calculate the mean of 1 column for all same entries of another column?
ORIGINAL DATA :
1445544152817,SEND_MSG,123
1445544152817,SEND_MSG,123
1445544152829,SEND_MSG,135
1445544152829,SEND_MSG,135
1445544152830,SEND_MSG,135
1445544152830,GET_QUEUE,12
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152830,SEND_MSG,136
1445544152831,SEND_MSG,138
1445544152831,SEND_MSG,136
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152831,SEND_MSG,137
1445544152832,SEND_MSG,138
1445544152832,SEND_MSG,138
1445544152833,SEND_MSG,138
1445544152833,SEND_MSG,139
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152834,SEND_MSG,140
1445544152835,SEND_MSG,140
1445544152835,SEND_MSG,141
1445544152849,SEND_MSG,155
1445544152849,SEND_MSG,155
1445544152850,GET_QUEUE,21
1445544152850,GET_QUEUE,21
For each timestamp I want to find average of response_time and plot,I did that successfully as shown in the series above(first data) but I cannot seperate the timestamp and response_time columns anymore.
A Series always has just one column. The first column you see is the index. You can get it by your_series.index(). If you want the timestamp to become a data column again, and not an index you can use the as_index keyword in groupby:
df.groupby('timestamp', as_index = False).mean()
Or use your_series.reset_index().
if its a series, you can directly use:
your_series.mean()
you can extract the column by:
df['column_name']
then you can apply mean() to the series:
df['column_name'].mean()