I have this data frame structure:
AU01_r AU02_r AU04_r AU05_r AU06_r AU07_r AU09_r AU10_r AU12_r
AU14_r AU15_r AU17_r AU20_r AU23_r AU25_r AU26_r AU45_r Segment
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
where each 7500 records have the same segment, I mean segments holds the following records ranges:
[[1, 7500], [7501, 15000], [15001, 22500], [22501, 30000], [30001, 37500], [37501, 45000], [45001, 52500], [52501, 60000], [60001, 67500], [67501, 75000], [75001, 82626]]
Currently I have a code - using pandas - that calculate the mean for each segment:
df_AU_r_2_mean = df_AU_r_2.groupby(['Segment']).mean()
AU01_r AU02_r AU04_r AU05_r AU06_r AU07_r AU09_r AU10_r
AU12_r AU14_r AU15_r AU17_r AU20_r AU23_r AU25_r AU26_r AU45_r WD CF
Segment
1 0.192525 0.156520 0.888929 0.049577 0.092363 0.609992 0.039349 0.385985 0.242643 0.395441 0.456475 0.504961 0.253471 0.074785 0.509816 0.307315 0.093600 1 1
2 0.190215 0.155545 1.027495 0.144367 0.121984 0.872449 0.103985 0.582804 0.311179 0.685669 0.358625 0.605624 0.182963 0.187416 0.530021 0.521449 0.158552 1 0
3 0.187849 0.114435 1.028465 0.110275 0.045937 0.755899 0.088371 0.395693 0.128856 0.376444 0.491379 0.528315 0.245704 0.086708 0.483681 0.442268 0.173515 1 0
But I need to enhance it, in such a way that I'll be able to calculate mean/sem/std for each one of the AU columns for each 1500 records (to divide each segment into smaller parts).
I wondered if it can be done using pandas data frame transformations?
First add a new column as an incremental id. This will be used to create your newer and smaller segments.
df.insert(0, 'id', range(1, 1 + len(df)))
After that create a new column that indicates each 1500 rows.
df["new_Segment"] = pd.to_numeric(df.id//1500).shift(fill_value=0).add(1)
Now, you can do the calculations based on new segment column.
df_AU_r_2_mean = df_AU_r_2.groupby(['new_Segment']).mean()
At the end, the dataframe will be:
id A B C Segment new_Segment
1 x x x 1 1
2 x x x 1 1
..
1500 x x x 1 1
1501 x x x 1 2
..
7500 x x x 1 5
7501 x x x 2 6
..
For creating new columns with calculations:
df["A_mean"] = df["A"].groupby(['new_Segment']).mean()
I have a dataframe data in which I took a subset of it g2_data to perform some operations on. How would I go about replacing values in the original dataframe with values from the subset, using values from one of the columns as the reference?
The column structure from data is retained in the subset g2_data shown below.
data:
idx group x1 y1
0 27 1 0.0 0.0
1 28 1 0.0 0.0
2 29 1 0.0 0.0
3 73 1 0.0 0.0
4 74 1 0.0 0.0
... ... ... ...
14612 14674 8 0.0 0.0
14613 14697 8 0.0 0.0
14614 14698 8 0.0 0.0
14615 14721 8 0.0 0.0
14616 14722 8 0.0 0.0
[14617 rows x 4 columns]
g2_data:
idx group x1 y1
1125 1227 2 115.0 0.0
1126 1228 2 0.0 220.0
1127 1260 2 0.0 0.0
1128 1294 2 0.0 0.0
1129 1295 2 0.0 0.0
... ... ... ...
3269 3277 2 0.0 0.0
3270 3308 2 0.0 0.0
3271 3309 2 0.0 0.0
3272 3342 2 0.0 0.0
3273 3343 2 0.0 0.0
[2149 rows x 4 columns]
Replace rows in Dataframe using index from another Dataframe has an answer to do this using the index values of the rows, but I would like to do it using the values from the idx column incase I need to reset the index in the subset later on (i.e. starting from 0 instead of using the index values from the original dataframe). It is important to note that the values in the idx column are all unique as they pertain to info about each observation.
This probably isn't optimal, but you can convert g2_data to a dictionary and then map the other columns based on idx, filtering the update to just those ids in the g2_data subset.
g2_data_dict = g2_data.set_index('idx').to_dict()
g2_data_ids = g2_data['idx'].to_list()
for k in g2_data_dict.keys():
data.loc[df['idx'].isin(g2_data_ids), k] = data['idx'].map(g2_data_dict[k])
Use combine_first:
out = g2_data.set_index('idx').combine_first(data.set_index('idx')).reset_index()
I have a set of csv files with Date and Time as the first two columns (no headers in the files). The files open up fine in Excel but when I try to read them into Python using Pandas read_csv, only the first Date is returned, whether or not I try a type conversion.
When I open in Notepad, it's not simply comma separated and has loads of space before each line after line 1; I have tried skipinitialspace = True to no avail
I have also tried various type conversions but none work. I am currently using parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True
Example output (no conversion):
0 1 2 3 4 ... 12 13 14 15 16
0 02/03/20 15:13:39 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
1 NaN 15:13:49 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
2 NaN 15:13:59 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
3 NaN 15:14:09 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
4 NaN 15:14:19 5.5 5.4 17.10 ... 30.0 79.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
39451 NaN 01:14:27 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39452 NaN 01:14:37 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39453 NaN 01:14:47 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39454 NaN 01:14:57 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39455 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
And with parse_dates etc:
Date_Time pH1 SP pH Ph1 PV pH ... 1 2 3
0 02/03/20 15:13:39 5.5 5.8 ... 0.0 0.0 0.0
1 nan 15:13:49 5.5 5.8 ... 0.0 0.0 0.0
2 nan 15:13:59 5.5 5.7 ... 0.0 0.0 0.0
3 nan 15:14:09 5.5 5.7 ... 0.0 0.0 0.0
4 nan 15:14:19 5.5 5.4 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ...
39451 nan 01:14:27 5.5 8.4 ... 0.0 0.0 0.0
39452 nan 01:14:37 5.5 8.4 ... 0.0 0.0 0.0
39453 nan 01:14:47 5.5 8.4 ... 0.0 0.0 0.0
39454 nan 01:14:57 5.5 8.4 ... 0.0 0.0 0.0
39455 nan nan NaN NaN ... NaN NaN NaN
Data copied from Notepad (there is actually more whitespace in front of each line but it wouldn't work here):
Data from 67.csv
02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0, 0.0
02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0, 0.0
And in Excel (so I know the information is there and readable):
Code
import sys
import numpy as np
import pandas as pd
from datetime import datetime
from tkinter import filedialog
from tkinter import *
def import_file(filename):
print('\nOpening ' + filename + ":")
##Read the data in the file
df = pd.read_csv(filename, header = None, low_memory = False)
print(df)
df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
df.drop(columns=[0, 1], inplace=True)
print(df)
filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()
if len(filenames) == 0:
print('No files selected - Exiting program.')
sys.exit()
else:
print('\n'.join(filenames))
##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
dfs.append(import_file(filename))
if len(dfs) > 1:
print('\nCombining data files.')
The file is filled with NUL, '\x00', which needs to be removed.
Use pandas.DataFrame to load the data from d, after the rows have been cleaned.
import pandas as pd
import string # to make column names
# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
# open the file and clean it
with open(filename) as f:
d = list(f.readlines())
# replace NUL, strip whitespace from the end of the strings, split each string into a list
d = [v.replace('\x00', '').strip().split(',') for v in d]
# remove some empty rows
d = [v for v in d if len(v) > 2]
# load the file with pandas
df = pd.DataFrame(d)
# convert column 0 and 1 to a datetime
df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])
# drop column 0 and 1
df.drop(columns=[0, 1], inplace=True)
# set datetime as the index
df.set_index('datetime', inplace=True)
# convert data in columns to floats
df = df.astype('float')
# give character column names
df.columns = list(string.ascii_uppercase)[:len(df.columns)]
# reset the index
df.reset_index(inplace=True)
return df.copy()
# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
dfs.append(import_file(filename))
display(df)
A B C D E F G H I J K L M N O
datetime
2020-02-03 15:13:39 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:49 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:59 5.5 5.7 34.26 7.2 6.8 10.63 60.0 22.3 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:09 5.5 5.7 34.26 7.2 6.8 10.63 60.0 15.3 300.0 45.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:19 5.5 5.4 17.10 7.2 6.8 10.63 60.0 50.2 300.0 86.0 30.0 79.0 0.0 0.0 0.0
I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.
Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0
I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0
I can't work out why this code is dropping values
solddf[['Name', 'Barcode', 'SalesRank', 'SoldPrices', 'SoldDates', 'SoldIds']].head()
Out[3]:
Name Barcode \
62693 Near Dark [DVD] [1988] [Region 1] [US Import] ... 1.313124e+10
94823 Battlefield 2 Modern Combat / Game 1.463315e+10
24965 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24964 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24963 Star Wars: The Force Unleashed (PS3) 2.327201e+10
SalesRank SoldPrices SoldDates SoldIds
62693 14.04 2017-08-05 07:28:56 162558627930
94823 1.49 2017-09-06 04:48:42 132301267483
24965 4.29 2017-08-23 18:44:42 302424166550
24964 5.27 2017-09-08 19:55:02 132317908530
24963 5.56 2017-09-15 08:23:24 132322978130
Here's my dataframe. It stores each sale I pull from an eBay API as a new row.
My aim to look for correlation between weekly sales and Amazon's Sales Rank.
solddf['Week'] = solddf['SoldDates'].apply(lambda x: x.week)
weeklysales = solddf.groupby(['Barcode', 'Week']).size().unstack()
weeklysales = weeklysales.fillna(0)
weeklysales['Mean'] = weeklysales.mean(axis=1)
weeklysales.head()
Out[5]:
Week 29 30 31 32 33 34 35 36 37 38 39 40 41 \
Barcode
1.313124e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.463315e+10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2.327201e+10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 2.0 0.0 2.0 1.0
2.327201e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2.327201e+10 0.0 0.0 3.0 2.0 2.0 2.0 1.0 1.0 5.0 0.0 2.0 2.0 1.0
Week 42 Mean
Barcode
1.313124e+10 0.0 0.071429
1.463315e+10 0.0 0.071429
2.327201e+10 0.0 0.642857
2.327201e+10 0.0 0.142857
2.327201e+10 0.0 1.500000
So, I've worked out the mean weekly sales for each item (or barcode)
I then want to take the mean values and insert them back into my solddf dataframe that I started with.
s1 = pd.Series(weeklysales.Mean, index=solddf.Barcode).reset_index()
s1 = s1.sort_values('Barcode')
s1.head()
Out[17]:
Barcode Mean
0 1.313124e+10 0.071429
1 1.463315e+10 0.071429
2 2.327201e+10 0.642857
3 2.327201e+10 0.642857
4 2.327201e+10 0.642857
This is looking fine, has the right number of rows and should fit
solddf = solddf.sort_values('Barcode')
solddf['WeeklySales'] = s1.Mean
This method seems to work, but I'm having an issue that some np.nan values are now appeared which weren't in s1 before
s1.Mean.isnull().sum()
Out[13]: 0
len(s1) == len(solddf)
Out[14]: True
But loads of my values that have passed across are now np.nan
solddf.WeeklySales.isnull().sum()
Out[16]: 27214
Can anyone tell me why?
While writing this I had an idea for a work-around
s1list = s1.Mean.tolist()
solddf['WeeklySales'] = s1list
solddf.WeeklySales.isnull().sum()
Out[20]: 0
Still curious what the problem with the previous method is though!
Instead of trying to align the two indices and inserting the new row, you should just use pd.merge.
output = pd.merge(solddf, s1, on='Barcode')
This way you can select the type of join you would like to do as well using the how kwarg.
I would also advise reading Merge, join, and concatenate as it covers a lot of helpful methods for combining dataframes.