import pandas as pd
year = '2018'
idNumber = '450007'
df = pd.read_csv('FARS_data/FARS' + year + 'NationalCSV/ACCIDENT.csv')
df = df.astype(str)
print (df.loc[df['ST_CASE'] == idNumber].squeeze(axis='index'))
df = pd.read_csv('FARS_data/FARS' + year + 'NationalCSV/PERSON.csv')
df = df.astype(str)
rows = df.loc[df['ST_CASE'] == idNumber]
for i in range(len(rows.index)):
print (rows.iloc[i].to_frame().transpose().squeeze(axis='index'))
The type of all the items printed is <class 'pandas.core.series.Series'>.
The reason I have a for loop for the second csv is that rows has two rows in it, while the first one has only one row, so I had to make the DataFrame produce two Series.
My output is below: (comments added by me)
#First print
STATE 45
ST_CASE 450007
VE_TOTAL 2
VE_FORMS 2
PVH_INVL 0
PEDS 0
PERNOTMVIT 0
PERMVIT 2
PERSONS 2
COUNTY 15
CITY 1072
DAY 3
MONTH 1
YEAR 2018
DAY_WEEK 4
HOUR 9
MINUTE 28
NHS 0
RUR_URB 2
FUNC_SYS 4
RD_OWNER 1
ROUTE 3
TWAY_ID SR-136
TWAY_ID2 nan
MILEPT 0
LATITUDE 32.15663889
LONGITUD -79.1526
SP_JUR 0
HARM_EV 12
MAN_COLL 6
RELJCT1 0
RELJCT2 1
TYP_INT 1
WRK_ZONE 0
REL_ROAD 1
LGT_COND 1
WEATHER1 3
WEATHER2 0
WEATHER 3
SCH_BUS 0
RAIL 0000000
NOT_HOUR 99
NOT_MIN 99
ARR_HOUR 99
ARR_MIN 99
HOSP_HR 99
HOSP_MN 99
CF1 0
CF2 0
CF3 0
FATALS 1
DRUNK_DR 0
Name: 25834, dtype: object
#Second print
STATE 45
ST_CASE 450007
VE_FORMS 2
VEH_NO 1
PER_NO 1
...
P_SF3 0
WORK_INJ 0
HISPANIC 7
RACE 1
LOCATION 0
Name: 64374, Length: 62, dtype: object
#Third print
STATE 45
ST_CASE 450007
VE_FORMS 2
VEH_NO 2
PER_NO 1
...
P_SF3 0
WORK_INJ 8
HISPANIC 0
RACE 0
LOCATION 0
Name: 64375, Length: 62, dtype: object
What I want to happen is for the second two prints to look like the first one - all pretty and formatted, showing all the columns.
The default pd.options.display.max_rows is 60, while PERSON.csv was 62 rows, so it wouldn't display. Setting pd.options.display.max_rows = 999 before running any of the code allowed everything to show.
Related
This question already has an answer here:
For loop that adds and deducts from pandas columns
(1 answer)
Closed 1 year ago.
So I want to spread the shipments per ID in the group one by one by looking at avg sales to determine who to give it to.
Here's my dataframe:
ID STOREID BAL SALES SHIP
1 STR1 50 5 18
1 STR2 6 7 18
1 STR3 74 4 18
2 STR1 35 3 500
2 STR2 5 4 500
2 STR3 54 7 500
While SHIP (grouped by ID) is greater than 0, calculate AVG (BAL/SALES) and the lowest AVG per group give +1 to its column BAL and +1 to its column final. And then repeat the process until SHIP is 0. The AVG would be different every stage which is why I wanted it to be a while loop.
Sample output of first round is below. So do this until SHIP is 0 and SUM of Final per ID is = to SHIP:
ID STOREID BAL SALES SHIP AVG Final
1 STR1 50 5 18 10 0
1 STR2 6 4 18 1.5 1
1 STR3 8 4 18 2 0
2 STR1 35 3 500 11.67 0
2 STR2 5 4 500 1.25 1
2 STR3 54 7 500 7.71 0
I've tried a couple of ways in SQL, I thought it would be better to do it in python but I haven't been doing a great job with my loop. Here's what I tried so far:
df['AVG'] = 0
df['FINAL'] = 0
for i in df.groupby(["ID"])['SHIP']:
if i > 0:
df['AVG'] = df['BAL'] / df['SALES']
df['SHIP'] = df.groupby(["ID"])['SHIP']-1
total = df.groupby(["ID"])["FINAL"].transform("cumsum")
df['FINAL'] = + 1
df['A'] = + 1
else:
df['FINAL'] = 0
This was challenging because more than one row in the group can have the same average calculation. then it throws off the allocation.
This works on the example dataframe, if I understood you correctly.
d = {'ID': [1, 1, 1, 2,2,2], 'STOREID': ['str1', 'str2', 'str3','str1', 'str2', 'str3'],'BAL':[50, 6, 74, 35,5,54], 'SALES': [5, 7, 4, 3,4,7], 'SHIP': [18, 18, 18, 500,500,500]}
df = pd.DataFrame(data=d)
df['AVG'] = 0
df['FINAL'] = 0
def calc_something(x):
# print(x.iloc[0]['SHIP'])
for i in range(x.iloc[0]['SHIP'])[0:500]:
x['AVG'] = x['BAL'] / x['SALES']
x['SHIP'] = x['SHIP']-1
x = x.sort_values('AVG').reset_index(drop=True)
# print(x.iloc[0, 2])
x.iloc[0, 2] = x['BAL'][0] + 1
x.iloc[0, 6] = x['FINAL'][0] + 1
return x
df_final = df.groupby('ID').apply(calc_something).reset_index(drop=True).sort_values(['ID', 'STOREID'])
df_final
ID STOREID BAL SALES SHIP AVG FINAL
1 1 STR1 50 5 0 10.000 0
0 1 STR2 24 7 0 3.286 18
2 1 STR3 74 4 0 18.500 0
4 2 STR1 127 3 0 42.333 92
5 2 STR2 170 4 0 42.500 165
3 2 STR3 297 7 0 42.286 243
I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)
I have a dataframe that looks like this:
>> df
index week day hour count
5 10 2 10 70
5 10 3 11 80
7 10 2 18 15
7 10 2 19 12
where week is the week of the year, day is day of the week (0-6), and hour is hour of the day (0-23). However, since I plan to convert this to a 3D array (week x day x hour) later, I have to include hours where there are no items in the count column. Example:
>> target_df
index week day hour count
5 10 0 0 0
5 10 0 1 0
...
5 10 2 10 70
5 10 2 11 0
...
7 10 0 0 0
...
...
and so on. What I do is to generate a dummy dataframe containing all index-week-day-hour combinations possible (basically target_df without the count column):
>> dummy_df
index week day hour
5 10 0 0
5 10 0 1
...
5 10 2 10
5 10 2 11
...
7 10 0 0
...
...
and then using
target_df = pd.merge(df, dummy_df, on=['index','week','day','hour'], how='outer').fillna(0)
This works fine for small datasets, but I'm working with a lot of rows. With the case I'm working on now, I get 82M rows for dummy_df and target_df, and it's painfully slow.
EDIT: The slowest part is actually constructing dummy_df!!! I can generate the individual lists but combining them into a pandas dataframe is the slowest part.
num_weeks = len(week_list)
num_idxs = len(df['index'].unique())
print('creating dummies')
_dummy_idxs = list(itertools.chain.from_iterable(
itertools.repeat(x, 24*7*num_weeks) for x in df['index'].unique()))
print('\t_dummy_idxs')
_dummy_weeks = list(itertools.chain.from_iterable(
itertools.repeat(x, 24*7) for x in week_list)) * num_idxs
print('\t_dummy_weeks')
_dummy_days = list(itertools.chain.from_iterable(
itertools.repeat(x, 24) for x in range(0,7))) * num_weeks * num_idxs
print('\t_dummy_days')
_dummy_hours = list(range(0,24)) * 7 * num_weeks * num_idxs
print('\t_dummy_hours')
print('Creating dummy_hour_df with {0} rows...'.format(len(_dummy_hours)))
# the part below takes the longest time
dummy_hour_df = pd.DataFrame({'index': _dummy_idxs, 'week': _dummy_weeks, 'day': _dummy_days, 'hour': _dummy_hours})
print('dummy_hour_df completed')
Is there a faster way to do this?
As an alternative, you can use itertools.product for the creation of dummy_df as a product of lists:
import itertools
index = range(100)
weeks = range(53)
days = range(7)
hours = range(24)
dummy_df = pd.DataFrame(list(itertools.product(index, weeks, days, hours)), columns=['index','week','day','hour'])
dummy_df.head()
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 0 2
3 0 0 0 3
4 0 0 0 4
I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.
I have data and have convert using dataframe pandas :
import pandas as pd
d = [
(1,70399,0.988375133622),
(1,33919,0.981573492596),
(1,62461,0.981426807114),
(579,1,0.983018778374),
(745,1,0.995580488899),
(834,1,0.980942505189)
]
df_new = pd.DataFrame(e, columns=['source_target']).sort_values(['source_target'], ascending=[True])
and i need build series for mapping column source and target into another
e = []
for x in d:
e.append(x[0])
e.append(x[1])
e = list(set(e))
df_new = pd.DataFrame(e, columns=['source_target'])
df_new.source_target = (df_new.source_target.diff() != 0).cumsum() - 1
new_ser = pd.Series(df_new.source_target.values, index=new_source_old).drop_duplicates()
so i get series :
source_target
1 0
579 1
745 2
834 3
33919 4
62461 5
70399 6
dtype: int64
i have tried change dataframe df_beda based on new_ser series using :
df_beda.target = df_beda.target.mask(df_beda.target.isin(new_ser), df_beda.target.map(new_ser)).astype(int)
df_beda.source = df_beda.source.mask(df_beda.source.isin(new_ser), df_beda.source.map(new_ser)).astype(int)
but result is :
source target weight
0 0 70399 0.988375
1 0 33919 0.981573
2 0 62461 0.981427
3 579 0 0.983019
4 745 0 0.995580
5 834 0 0.980943
it's wrong, ideal result is :
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
maybe anyone can help me for show where my mistake
Thanks
If the order doesn't matter, you can do the following. Avoid for loop unless it's absolutely necessary.
uniq_vals = np.unique(df_beda[['source','target']])
map_dict = dict(zip(uniq_vals, xrange(len(uniq_vals))))
df_beda[['source','target']] = df_beda[['source','target']].replace(map_dict)
print df_beda
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
If you want to roll back, you can create an inverse map from the original one, because it is guaranteed to be 1-to-1 mapping.
inverse_map = {v:k for k,v in map_dict.iteritems()}
df_beda[['source','target']] = df_beda[['source','target']].replace(inverse_map)
print df_beda
source target weight
0 1 70399 0.988375
1 1 33919 0.981573
2 1 62461 0.981427
3 579 1 0.983019
4 745 1 0.995580
5 834 1 0.980943