I have two dataframes like the ones sampled below. I'm trying to append the records from one of the dataframes to the bottom of the first. So the final data frame should only have two columns. Instead I seem to be appending the columns from one dataframe on to the right side of the first. Does anyone see what I'm doing wrong?
Code:
appendDf=df1.append(df2)
df1
28343 \
0 42267
1 157180
2 186320
https://s.m.com/is/ime/M/ts/mized/5_fpx.tif
0 https://sl.com/is/i/M/...
1 https://sl.com/is/i/M/…
2 https://sl.com/is/im/M/...
df2
454 \
0 223
1 155
2 334
https://s.m.com/is/ime/M/ts/mized/5.tif
0 https://slret.com/is/i/M/...
1 https://slfdsd.com/is/i/M/…
2 https://slfd.com/is/im/M/...
appendDf.head()
28343 https://s.m.com/is/ime/M/ts/mized/5_fpx.tif 454 https://s.m.com/is/ime/M/ts/mized/5.tif
Your DataFrames do not seem to have column headers (I imagine the first row of your data is being used as the column headers), which is likely the root of your issue. When you append the second DataFrame, the program doesn't know which columns the data correspond to, so it adds them as new columns. See the following example:
import pandas as pd
df1 = pd.DataFrame([[28343, 'http://link1'], [42267, 'http://link2'],
[157180, 'http://link3'], [186320, 'http://link4']], columns=['ID','Link'])
df2 = pd.DataFrame([[454, 'http://link5'], [223, 'http://link6'],
[155, 'http://link7'], [334, 'http://link8']])
appendedDF = df1.append(df2)
Yields:
ID Link 0 1
0 28343.0 http://link1 NaN NaN
1 42267.0 http://link2 NaN NaN
2 157180.0 http://link3 NaN NaN
3 186320.0 http://link4 NaN NaN
0 NaN NaN 454.0 http://link5
1 NaN NaN 223.0 http://link6
2 NaN NaN 155.0 http://link7
3 NaN NaN 334.0 http://link8
Correct implementation:
import pandas as pd
df1 = pd.DataFrame([[28343, 'http://link1'], [42267, 'http://link2'],
[157180, 'http://link3'], [186320, 'http://link4']], columns=['ID','Link'])
df2 = pd.DataFrame([[454, 'http://link5'], [223, 'http://link6'],
[155, 'http://link7'], [334, 'http://link8']], columns=['ID','Link'])
appendedDF = df1.append(df2).reset_index(drop=True)
Yields:
ID Link
0 28343 http://link1
1 42267 http://link2
2 157180 http://link3
3 186320 http://link4
4 454 http://link5
5 223 http://link6
6 155 http://link7
7 334 http://link8
Related
I have a Pandas DataFrame with several columns.
One of these ('Code') is object-type but has missing data (NaN). Other data can be numbers or letters.
For the missing data, I want to do a map / set_index function in order to fill in the data.
Here is my code:
for row in df['Code']:
if pd.isnull(row) == True:
df['Code']= df['account'].map(df_2.set_index('AccountID')['AccountCode'])
else:
None
However, this code deletes all data from the entire columns.
This is the original (I mean to do the map function on the NaN values only!) :
0 23050178040
1 23050178040
2 23050178040
3 23050178106
4 23050178040
...
288 23050942326
289 23050942326
290 NaN
291 23050942858
292 NaN
Name: Code BU, Length: 293, dtype: object
And the result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
288 NaN
289 NaN
290 NaN
291 NaN
292 NaN
Name: Code BU, Length: 293, dtype: object
What is the issue here?
Instead all your code loop use Series.fillna:
df['Code']= df['Code'].fillna(df['account'].map(df_2.set_index('AccountID')['AccountCode']))
I am importing an excel worksheet using pandas and trying to remove any instance where there is a duplicate area measurement for a given Frame. The sheet I'm playing with looks vaguely like the table below wherein there are n number of files, a measured area from each frame of an individual file, and the Frame Number that corresponds to each area measurement.
Filename.0
Area.0
Frame.0
Filename.1
Area.1
Frame.1
...
Filename.n
Area.n
Filename.n
Exp327_Date_File_0
600
1
Exp327_Date_File_1
830
1
...
Exp327_Date_File_n
700
1
Exp327_Date_File_0
270
2
Exp327_Date_File_1
730
1
...
Exp327_Date_File_n
600
2
Exp327_Date_File_0
230
3
Exp327_Date_File_1
630
2
...
Exp327_Date_File_n
500
3
Exp327_Date_File_0
200
4
Exp327_Date_File_1
530
3
...
Exp327_Date_File_n
400
4
NaN
NaN
NaN
Exp327_Date_File1
430
4
...
NaN
NaN
NaN
If I manually go through the excel worksheet and concatenate the filenames into just 3 unique columns containing my entire dataset like so:
Filename
Area
Frame
Exp327_Date_File_0
600
1
Exp327_Date_File_0
270
2
etc...
etc...
etc...
Exp327_Date_File_n
530
4
I have been able to successfully use pandas to remove the duplicates using the following:
df_1 = df.groupby(['Filename', 'Frame Number']).agg('Area': 'sum')
However, manually concatenating everything into this format isn't feasible when I have hundreds of File replicates and I will then have to separate everything back out into multiple column-sets (similar to how the data is presented in Table 1). How do I either (1) use pandas to create a new Dataframe with every 3 columns stacked on top of each other which I can then group and aggregate before breaking back up into individual sets of columns based on Filename or (2) loop through the multiple filenames and aggregate any Frames with multiple Areas? I have tried option 2:
(row, col) = df.shape #shape of the data frame the excel file was read into
for count in range(0,round(col/3)): #iterate through the data
aggregation_functions = {'Area.'+str(count):'sum'} #add Areas together
df_2.groupby(['Filename.'+str(count), 'Frame Number.'+str(count)]).agg(aggregation_functions)
However, this just returns the same DataFrame without any of the Areas summed together. Any help would be appreciated and please let me know if my question is unclear
Here is a way to achieve option (1):
import numpy as np
import pandas as pd
# sample data
df = pd.DataFrame({'Filename.0': ['Exp327_Date_File_0', 'Exp327_Date_File_0',
'Exp327_Date_File_0', 'Exp327_Date_File_0',
np.NaN],
'Area.0': [600, 270, 230, 200, np.NaN],
'Frame.0': [1, 2, 3, 4, np.NaN],
'Filename.1': ['Exp327_Date_File_1', 'Exp327_Date_File_1',
'Exp327_Date_File_1', 'Exp327_Date_File_1',
'Exp327_Date_File_1'],
'Area.1': [830, 730, 630, 530, 430],
'Frame.1': [1, 1, 2, 3, 4],
'Filename.2': ['Exp327_Date_File_2', 'Exp327_Date_File_2',
'Exp327_Date_File_2', 'Exp327_Date_File_2',
'Exp327_Date_File_2'],
'Area.2': [700, 600, 500, 400, np.NaN],
'Frame.2': [1, 2, 3, 4, np.NaN]})
# create list of sub-dataframes, each with 3 columns, partitioning the original dataframe
subframes = [df.iloc[:, j:(j + 3)] for j in np.arange(len(df.columns), step=3)]
# set column names to the same values for each subframe
for subframe in subframes:
subframe.columns = ['Filename', 'Area', 'Frame']
# concatenate the subframes
df_long = pd.concat(subframes)
df_long
Filename Area Frame
0 Exp327_Date_File_0 600.0 1.0
1 Exp327_Date_File_0 270.0 2.0
2 Exp327_Date_File_0 230.0 3.0
3 Exp327_Date_File_0 200.0 4.0
4 NaN NaN NaN
0 Exp327_Date_File_1 830.0 1.0
1 Exp327_Date_File_1 730.0 1.0
2 Exp327_Date_File_1 630.0 2.0
3 Exp327_Date_File_1 530.0 3.0
4 Exp327_Date_File_1 430.0 4.0
0 Exp327_Date_File_2 700.0 1.0
1 Exp327_Date_File_2 600.0 2.0
2 Exp327_Date_File_2 500.0 3.0
3 Exp327_Date_File_2 400.0 4.0
4 Exp327_Date_File_2 NaN NaN
I have two dataframes as follows
transactions
buy_date buy_price
0 2018-04-16 33.23
1 2018-05-09 33.51
2 2018-07-03 32.74
3 2018-08-02 33.68
4 2019-04-03 33.58
and
cii
from_fy to_fy score
0 2001-04-01 2002-03-31 100
1 2002-04-01 2003-03-31 105
2 2003-04-01 2004-03-31 109
3 2004-04-01 2005-03-31 113
4 2005-04-01 2006-03-31 117
In the transactions dataframe I need to create a new columns cii_score based on the following condition
if transactions['buy_date'] is between cii['from_fy'] and cii['to_fy'] take the cii['score'] value for transactions['cii_score']
I have tried list comprehension but it is no good.
Request your inputs to tackle this.
First, we set up your dfs. Note I modified the dates in transactions in this short example to make it more interesting
import pandas as pd
from io import StringIO
trans_data = StringIO(
"""
,buy_date,buy_price
0,2001-04-16,33.23
1,2001-05-09,33.51
2,2002-07-03,32.74
3,2003-08-02,33.68
4,2003-04-03,33.58
"""
)
cii_data = StringIO(
"""
,from_fy,to_fy,score
0,2001-04-01,2002-03-31,100
1,2002-04-01,2003-03-31,105
2,2003-04-01,2004-03-31,109
3,2004-04-01,2005-03-31,113
4,2005-04-01,2006-03-31,117
"""
)
tr_df = pd.read_csv(trans_data, index_col = 0)
tr_df['buy_date'] = pd.to_datetime(tr_df['buy_date'])
cii_df = pd.read_csv(cii_data, index_col = 0)
cii_df['from_fy'] = pd.to_datetime(cii_df['from_fy'])
cii_df['to_fy'] = pd.to_datetime(cii_df['to_fy'])
The main thing is the following calculation: for each row index of tr_df find the index of the row in cii_df that satisfies the condition. The following calculates this match, each element of the list is equal to the appropriate row index of cii_df:
match = [ [(f<=d) & (d<=e) for f,e in zip(cii_df['from_fy'],cii_df['to_fy']) ].index(True) for d in tr_df['buy_date']]
match
produces
[0, 0, 1, 2, 2]
now we can merge on this
tr_df.merge(cii_df, left_on = np.array(match), right_index = True)
so that we get
key_0 buy_date buy_price from_fy to_fy score
0 0 2001-04-16 33.23 2001-04-01 2002-03-31 100
1 0 2001-05-09 33.51 2001-04-01 2002-03-31 100
2 1 2002-07-03 32.74 2002-04-01 2003-03-31 105
3 2 2003-08-02 33.68 2003-04-01 2004-03-31 109
4 2 2003-04-03 33.58 2003-04-01 2004-03-31 109
and the score column is what you asked for
I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]
The Overview:
In our project, we are working with a CSV file that contains some data. We will call it smal.csv It is a bit of a chunky file that will be later used for some other algorithms. (Here is the gist in case the link to smal.csv is too badly formatted for your browser.)
The file will be loaded like this
filename = "smal.csv"
keyname = "someKeyname"
self.data[keyname] = spectral_data(pd.read_csv(filename, header=[0, 1], verbose=True))
The spectral class looks like this. As you can see, we do not actually keep the dataframe as is.
class spectral_data(object):
def __init__(self, df):
try:
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
except:
df.columns = pd.MultiIndex.from_tuples(list(df.columns))
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
for i, val in enumerate(lowercols):
try:
lowercols[i] = float(val)
except:
lowercols[i] = val
levels = [uppercols, lowercols]
df.columns.set_levels(levels, inplace=True)
self.df = df
After we've loaded it we'd like to concatenate it with another set of data, also loaded like smal.csv was.
Our concatenation is done like this.
new_df = pd.concat([self.data[dataSet1].df, self.data[dataSet2].df], ignore_index=True)
However, the ignore_index=True does not work, because the actual row that we are concatenating is not the index. However, we cannot simply remove the column, it is necessary for other parts of our program.
The Objective:
I'm trying to concatenate a couple of data frames together, however, what I thought was the index is not actually the index for the data frame. Thus the command
pd.concat([df1.df, df2.df], ignore_index=True)
will not work. I thought maybe using iloc to change each individual cell would work but I feel like this is not the most intuitive way to approach this.
How can I get a data frame that looks like this
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 1 NaN ... 43.28
5 2 NaN ... 41.33 47.33
6 3 NaN ... -21.94 12.06
7 4 NaN ... -30.94 -1.94
8 5 NaN ... -24.78 40.22
Turn into this.
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 5 NaN ... 43.28
5 6 NaN ... 41.33 47.33
6 7 NaN ... -21.94 12.06
7 8 NaN ... -30.94 -1.94
8 9 NaN ... -24.78 40.22