I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:
id number_of_visits
0 3902932804358904910 2
1 5972629290368575970 1
2 5345473950081783242 1
3 4289865755939302179 1
4 36619425050724793929 19
+
id number_of_visits
0 3902932804358904910 5
1 5972629290368575970 10
2 5345473950081783242 3
3 4289865755939302179 20
4 36619425050724793929 13
=
id number_of_visits
0 3902932804358904910 7
1 5972629290368575970 11
2 5345473950081783242 4
3 4289865755939302179 21
4 36619425050724793929 32
My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df and new_df wasn't being added. So I tried the following implementation.
def add_dfs(files):
master = []
big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
for k in range(len(files)):
new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
master.append(new_df) #creates a list of dataframes with in master
for k in range(len(master)):
big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
# iterate through list of dfs and add them together
return big
Which gives me the following
id number_of_visits
1 1000036822946495682 2
2 1000036822946495682 4
3 1000044447054156512 1
4 1000044447054156512 9
5 1000131582129684623 1
So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits
Pass your list of dataframes directly to concat() then group on the id and sum.
>>> pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
id number_of_visits
0 36619425050724793929 32
1 3902932804358904910 7
2 4289865755939302179 21
3 5345473950081783242 4
4 5972629290368575970 11
def add_dfs(files):
master = []
for f in files:
new_df = create_df(f)
master.append(new_df)
big = pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
return big
You can use
df1['number_of_visits'] += df2['number_of_visits']
this gives you:
| | id | number_of_visits |
|---:|---------------------:|-------------------:|
| 0 | 3902932804358904910 | 7 |
| 1 | 5972629290368575970 | 11 |
| 2 | 5345473950081783242 | 4 |
| 3 | 4289865755939302179 | 21 |
| 4 | 36619425050724793929 | 32 |
Related
I have this dataframe with columns like
| LHA_1 | JH_1 | LHA_2 | JH_2 | LHA_3 | JH_3 | LHA_4 | JH_5 | ....
What I would like to do is to have LHA_2 - JH_1, LHA3 - JH_2, LHA4 - JH_3.....
and the final would look like
| LHA_1 | JH_1 | LHA_2 |LHA_2 - JH_1| JH_2 | LHA_3 |LHA_3 - JH_2| JH_3 | LHA_4 | JH_5 | ....
You can use pd.IndexSlice. Suppose the following dataframe:
>>> df
LHA_1 JH_1 LHA_2 JH_2 LHA_3 JH_3 LHA_4 JH_5 LHA_6
0 9 8 7 6 5 4 3 2 1
1 19 18 17 16 15 14 13 12 11
res = df.loc[:, pd.IndexSlice['LHA_2'::2]].values \
- df.loc[:, pd.IndexSlice['JH_1'::2]].values
res = pd.DataFrame(res).add_prefix('LHA_JH_')
out = pd.concat([df, res], axis=1)
print(out)
# Output
LHA_1 JH_1 LHA_2 JH_2 LHA_3 JH_3 LHA_4 JH_5 LHA_6 LHA_JH_0 LHA_JH_1 LHA_JH_2 LHA_JH_3
0 9 8 7 6 5 4 3 2 1 -1 -1 -1 -1
1 19 18 17 16 15 14 13 12 11 -1 -1 -1 -1
I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?
Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I have the following type of data, I am doing predictions:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
Output: 6
7
8
I want to predict one at a time and feed it back to the input as the value for the last column. I use this function but it is not working well:
def moving_window(num_future_pred):
preds_moving = []
moving_test_window = [test_X[0,:].tolist()]
moving_test_window = np.array(moving_test_window)
for j in range(1, len(test_Y)):
moving_test_window = [test_X[j,:].tolist()]
moving_test_window = np.array(moving_test_window)
pred_one_step = model.predict(moving_test_window[:,:,:])
preds_moving.append(pred_one_step[0,0])
pred_one_step = pred_one_step.reshape((1,1,1))
moving_test_window =
np.concatenate((moving_test_window[:,:4,:], pred_one_step), axis= 1)
return preds_moving
preds_moving = moving_window(len(test_Y))
What I want:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 6
1 | 2 | 3 | 4 | 17
Output: 6
17
18
Basically to make the first prediction [1,2,3,4,5] --> 6 and then remove the last column [5] from the next inputs and add the predicted value at each time.
What it does now, it just takes all the inputs as they are and makes predictions for each row. Any idea appreciated!
I cannot figure out how to get rid of rows (but keep the first occurence and get rid of every row that has the value) with some condition.
I tried using drop_duplicate but this will get rid of everything. I just want to get rid of some rows with a specific value (Within the same column)
Data is formatted like so:
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
1 | 5
5 | 6
I want it like (based on Col_A):
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
5 | 6
Use idxmax and check the index. This of course assumes your index is unique.
m = df.Col_A.eq(1) # replace 1 with your desired bad value
df.loc[~m | (df.index == m.idxmax())]
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6
Try this:
df1=df.copy()
mask=df['Col_A'] == 5
df1.loc[mask,'Col_A'] = df1.loc[mask,'Col_A']+range(len(df1.loc[mask,'Col_A']))
df1=df1.drop_duplicates(subset='Col_A',keep='first')
print(df.iloc[df1.index])
Output:
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6
Let's say I have these 2 pandas dataframes:
id | userid | type
1 | 20 | a
2 | 20 | a
3 | 20 | b
4 | 21 | a
5 | 21 | b
6 | 21 | a
7 | 21 | b
8 | 21 | b
I want to obtain the number of times 'b follows a' for each user, and obtain a new dataframe like this:
userid | b_follows_a
20 | 1
21 | 2
I know I can do this using for loop. However, I wonder if there is a more elegant solution to this.
You can use shift() to check if a is followed by b with vectorized & and then count the trues with a sum:
df.groupby('userid').type.apply(lambda x: ((x == "a") & (x.shift(-1) == "b")).sum()).reset_index()
#userid type
#0 20 1
#1 21 2
Creative solution:
In [49]: df.groupby('userid')['type'].sum().str.count('ab').reset_index()
Out[49]:
userid type
0 20 1
1 21 2
Explanation:
In [50]: df.groupby('userid')['type'].sum()
Out[50]:
userid
20 aab
21 ababb
Name: type, dtype: object