The data is simplified as follow:
mon site year data1 data2
1 57598 2001 58 1383
2 57598 2001 75 549
1 57598 2002 118 1337
2 57598 2002 162 2213
1 50136 2000 -282 134
2 50136 2000 -242 0
1 50136 2001 -126 102
1 50844 2000 152 411
2 50844 2000 70 117
1 50844 2002 -74 44
2 50844 2002 -173 83
I want to extract the data1 and data2 and change to following form:
this is data1:
2000 2000 2001 2001 2002 2002
1 2 1 2 1 2
50136 -282 -242 -126 NA NA NA
50844 152 70 NA NA -74 -173
57598 58 75 NA NA 118 162
and data2 will be saved as new file with the same form to data1.
I want to use pandas.groupby to operate, but the code as follow is error:
df['data1'].groupby(df['year'],df['mon'],df['site'])
Is easy to go by using groupby?
I think first is best try set_index with unstack:
df1 = df.set_index(['year','mon','site'])['data1'].unstack(level=[0,1]).sort_index(axis=1)
print (df1)
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
but if get:
ValueError: Index contains duplicate entries, cannot reshape
use another solution with groupby or pivot_table:
You can use groupby with unstack:
df1 = df.groupby(['year','mon','site'])['data1'].mean().unstack(level=[0,1])
print (df1)
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
Another possible solution with pivot_table with default aggfunc which is np.mean, but can be changed to another functions like aggfunc='sum', ...:
print (df.pivot_table(index='site', columns=['year','mon'], values='data1', aggfunc=np.mean))
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
Last use DataFrame.to_csv for write file to csv.
df1.to_csv('file_out.csv')
To get the df in the shape in which you need it:
result = df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack()
Out[310]:
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
To save it to csv:
df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack().to_csv('data1.csv')
df.groupby(['site','mon','year'])['data2'].mean().unstack().unstack().to_csv('data2.csv')
Related
First of all, sorry for the bad title. I will illustrate better here. I have a dataframe such as this:
level 1
level 2
qty 2
level 3
qty 3
level 4
qty 4
1980
2302
1.2
nan
nan
nan
-----
1980
7117
2.4
10025
15
2343
11
1980
7117
2.4
1221
1.3
nan
nan
1870
2333
22
nan
nan
nan
nan
1870
7117
2.1
10025
12
nan
nan
1870
7117
2.1
5445
11
nan
nan
It is a flatten hierarchy that describe which components goes into a product. Level 1 being the finished goods (e.g. pizza) and level 2,3 and so on are ingredients used to make said product. I need to the following logic.
df_grouped = df.groupby (by = ['level 1'])
range = [4,3,2]
for group in df_grouped:
for i in range:
df[f'qty {i}] = df[f'qty {i-1}'] * df[f'qty {i}']/df[f'qty {i}'].groupby(f'level {i-1}')[f'qty {i}'].transform ('sum'))
Okey, so what I need to do if we, for instance, look at level 1 = 1980 and level 2 = 7117. I need to take 2.4 * 15/(15+1.3). The same goes for the row below: 2.4 * 1.3/(15 + 1.3)
This needs to be done for each level of each level 1 (product)
expected output:
level 1
level 2
qty 2
level 3
qty 3
level 4
qty 4
1980
2302
1.2
nan
nan
nan
-----
1980
7117
2.4
10025
2.20858895706
2343
15
1980
7117
2.4
1221
0.19141104294
nan
nan
1870
2333
22
nan
nan
nan
nan
1870
7117
2.1
10025
1.09565217391
nan
nan
1870
7117
2.1
5445
1.00434782609
nan
nan
I have two dataframes
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 36 28 6 20 1 ... 5 0 0 50 23 0
1 2021-04-13 46 15 5 16 6 ... 5 0 0 122 12 1
2 2021-04-14 12 4 1 5 2 ... 2 0 0 39 1 0
3 2021-04-15 30 23 3 14 2 ... 15 0 0 101 9 0
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 41 28 4 33 10 ... 5 0 0 56 14 3
1 2021-04-13 76 22 7 12 29 ... 4 0 0 134 8 2
2 2021-04-14 21 15 2 7 16 ... 2 0 0 61 3 0
3 2021-04-15 54 43 9 2 31 ... 16 0 0 83 13 1
I want to remove numbers from two dataframe that are lower than 10 if the instance is deleted from one dataframe the same cell should be remove in another dataframe same thing goes other way around
Appreciate your help
Use a mask:
## pre-requisite
df1 = df1.set_index('dt')
df2 = df2.set_index('dt')
## processing
mask = df1.lt(10) | df2.lt(10)
df1 = df1.mask(mask)
df2 = df2.mask(mask)
output:
>>> df1
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 36 28.0 NaN 20.0 NaN NaN NaN NaN 50 23.0 NaN
2021-04-13 46 15.0 NaN 16.0 NaN NaN NaN NaN 122 NaN NaN
2021-04-14 12 NaN NaN NaN NaN NaN NaN NaN 39 NaN NaN
2021-04-15 30 23.0 NaN NaN NaN 15.0 NaN NaN 101 NaN NaN
>>> df2
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 41 28.0 NaN 33.0 NaN NaN NaN NaN 56 14.0 NaN
2021-04-13 76 22.0 NaN 12.0 NaN NaN NaN NaN 134 NaN NaN
2021-04-14 21 NaN NaN NaN NaN NaN NaN NaN 61 NaN NaN
2021-04-15 54 43.0 NaN NaN NaN 16.0 NaN NaN 83 NaN NaN
I'm working with a large dataset and have the following issue:
Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.
Example:
import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
60,10,10,40,np.NaN,np.NaN,
np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
1991,1992,np.NaN,1995,1995,np.NaN,
2001,2002,2004,2004,2005,np.NaN]
dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)
I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:
dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
id year sub-input y-arrival input_sum_tf
0 1 2000 20.0 2002.0 20.0
1 1 2001 40.0 2004.0 80.0
2 1 2002 10.0 2004.0 80.0
3 1 2003 30.0 2004.0 80.0
4 1 2004 50.0 2005.0 50.0
5 1 2005 80.0 NaN NaN
6 2 1990 60.0 1991.0 60.0
7 2 1991 10.0 1992.0 10.0
8 2 1992 10.0 NaN NaN
9 2 1993 40.0 1995.0 40.0
10 2 1994 NaN 1995.0 40.0
11 2 1995 NaN NaN NaN
12 3 2000 NaN 2001.0 0.0
13 3 2001 120.0 2002.0 120.0
14 3 2002 30.0 2004.0 100.0
15 3 2003 70.0 2004.0 100.0
16 3 2004 60.0 2005.0 60.0
17 3 2005 90.0 NaN NaN
Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.
The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:
id = 1, year = 2000 --> no y-arrival = 2000 --> = NaN
id = 1, year = 2001 --> no y-arrival = 2001 --> = NaN
id = 1, year = 2002 --> y-arrival = 2002 has an input_sum_tf = 20 --> = 20
id = 1, year = 2003 --> no y-arrival = 2003 --> = NaN
id = 1, year = 2004 --> y-arrival = 2004 has an input_sum_tf = 80 --> = 80
The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).
The result ("input_sum") should look like this:
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
My approach:
I tried solving this by using the merge-function of pandas on two columns, but the result isn't quite right. So far my code only works for the first 5 columns.
dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
right_on=['id','year'],
how='right')['input_sum_tf_x']
dfex3["input_sum"]
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 80.0
6 80.0
7 50.0
8 NaN
9 60.0
10 10.0
11 NaN
12 NaN
13 40.0
14 40.0
15 NaN
16 0.0
17 120.0
Any help would be much appreciated!
The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.
Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):
df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]
Then merge:
dfex3.merge(df_last,
left_on=['id', 'year'],
right_on=['id', 'y-arrival'],
how='left')['input_sum_tf_y']
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
'''
I am writing a code in pandas. Stuck on the below part where I need to use missing rows.
'''
df
A B C D E E G H
0 US BENIN 1995 5 10 15 40
1 US BENIN 1996 6 12 12 12
2 US BENIN 2000 4 13 12 12
3 US Hungary 1998 5 19 23 23
4 US Hungary 1999 3 23 12 3
5 UK Chile 2000 5 10 15 40
6 UK Chile 2002 6 12 12 12
7 UK Chile 2004 4 13 12 12
8 UK Iceland 2004 5 19 23 23
89UK Iceland 2005 3 23 12 3
'''
I want to add blank rows for missing years from 1995 to 2000 in between these rows using a loop
'''
Desired output:
A B C D E F G H
0 US BENIN 1995 5 10 15 40
1 US BENIN 1996 6 12 12 12
2 US BENIN 1997
3 US BENIN 1998
4 US BENIN 1999
5 US BENIN 2000 4 13 12 12
6 US BENIN 2001
7 US BENIN 2002
8 US BENIN 2003
9 US BENIN 2004
10US BENIN 2005
11US Hungary 1995
12US Hungary 1996
13US Hungary 1997
14US Hungary 1998 5 19 23 23
15US Hungary 1999 3 23 12 3
16US Hungary 1999
17US Hungary 2000
18US Hungary 2001
19US Hungary 2002
20US Hungary 2003
21US Hungary 2004
22US Hungary 2005
23UK Chile 1995
24UK Chile 1996
25UK Chile 1997
26UK Chile 1998
27UK Chile 1999
28UK Chile 2000 5 10 15 40
29UK Chile 2001
30UK Chile 2002 6 12 12 12
31UK Chile 2003
32UK Chile 2004 4 13 12 12
33UK Chile 2005
:
:
:
:
43 UK Iceland 2004 5 19 23 23
44 UK Iceland 2005 3 23 12 3
New Solution:
import re
import pandas as pd
df: pd.DataFrame = pd.DataFrame([
re.match('(\w+)\ +(\w+)\ +(\w+)\ +(\w+)\ +(\w+)\ +(\w+)\ +(\w+)\ +(\w+)', data).groups() for data in '''
A B C D E E G H
0 US BENIN 1995 5 10 15 40
1 US BENIN 1996 6 12 12 12
2 US BENIN 2000 4 13 12 12
3 US Hungary 1998 5 19 23 23
4 US Hungary 1999 3 23 12 3
5 UK Chile 2000 5 10 15 40
6 UK Chile 2002 6 12 12 12
7 UK Chile 2004 4 13 12 12
8 UK Iceland 2004 5 19 23 23
89 UK Iceland 2005 3 23 12 3
'''.split('\n')[1:-1]
], dtype='int32')
def consolidate(index, year_min, year_max):
indexes: list = []
last_country, last_county, last_year = None, None, year_min
for country, county, year in index:
for yr in range(last_year, year):
indexes.append((country, county, yr))
last_country, last_county, last_year = country, county, year
if last_year <= year_max:
for yr in range(last_year, year_max + 1):
indexes.append((last_country, last_county, yr))
return indexes
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]
df.iloc[:, -5] = df.D.astype('int')
df = df.sort_values(['D', 'C'])
year_min, year_max = df.D.min(), df.D.max()
df.set_index(['B', 'C', 'D'], inplace=True)
df1 = df.groupby(['B', 'C']).apply(lambda x: x.reindex(consolidate(x.index, year_min, year_max)))
df1.index = df1.index.droplevel([0, 1])
df = df1.reset_index()
if __name__ == '__main__':
print(df)
# 0 B C D A E E G H
# 0 UK Chile 1995 NaN NaN NaN NaN NaN
# 1 UK Chile 1996 NaN NaN NaN NaN NaN
# 2 UK Chile 1997 NaN NaN NaN NaN NaN
# 3 UK Chile 1998 NaN NaN NaN NaN NaN
# 4 UK Chile 1999 NaN NaN NaN NaN NaN
# 5 UK Chile 2000 5 5 10 15 40
# 6 UK Chile 2001 NaN NaN NaN NaN NaN
# 7 UK Chile 2002 6 6 12 12 12
# 8 UK Chile 2003 NaN NaN NaN NaN NaN
# 9 UK Chile 2004 7 4 13 12 12
# 10 UK Chile 2005 NaN NaN NaN NaN NaN
# 11 UK Iceland 1995 NaN NaN NaN NaN NaN
# 12 UK Iceland 1996 NaN NaN NaN NaN NaN
# 13 UK Iceland 1997 NaN NaN NaN NaN NaN
# 14 UK Iceland 1998 NaN NaN NaN NaN NaN
# 15 UK Iceland 1999 NaN NaN NaN NaN NaN
# 16 UK Iceland 2000 NaN NaN NaN NaN NaN
# 17 UK Iceland 2001 NaN NaN NaN NaN NaN
# 18 UK Iceland 2002 NaN NaN NaN NaN NaN
# 19 UK Iceland 2003 NaN NaN NaN NaN NaN
# 20 UK Iceland 2004 8 5 19 23 23
# 21 UK Iceland 2005 89 3 23 12 3
# 22 US BENIN 1995 0 5 10 15 40
# 23 US BENIN 1996 1 6 12 12 12
# 24 US BENIN 1997 NaN NaN NaN NaN NaN
# 25 US BENIN 1998 NaN NaN NaN NaN NaN
# 26 US BENIN 1999 NaN NaN NaN NaN NaN
# 27 US BENIN 2000 2 4 13 12 12
# 28 US BENIN 2001 NaN NaN NaN NaN NaN
# 29 US BENIN 2002 NaN NaN NaN NaN NaN
# 30 US BENIN 2003 NaN NaN NaN NaN NaN
# 31 US BENIN 2004 NaN NaN NaN NaN NaN
# 32 US BENIN 2005 NaN NaN NaN NaN NaN
# 33 US Hungary 1995 NaN NaN NaN NaN NaN
# 34 US Hungary 1996 NaN NaN NaN NaN NaN
# 35 US Hungary 1997 NaN NaN NaN NaN NaN
# 36 US Hungary 1998 3 5 19 23 23
# 37 US Hungary 1999 4 3 23 12 3
# 38 US Hungary 2000 NaN NaN NaN NaN NaN
# 39 US Hungary 2001 NaN NaN NaN NaN NaN
# 40 US Hungary 2002 NaN NaN NaN NaN NaN
# 41 US Hungary 2003 NaN NaN NaN NaN NaN
# 42 US Hungary 2004 NaN NaN NaN NaN NaN
# 43 US Hungary 2005 NaN NaN NaN NaN NaN
Im not sure how optimal this is but since you asked for details on this approach, here it is.
Assuming you have the data you want to add in a DataFrame like this:
print(df_to_add)
# B C D
# 0 US Hungary 1995
# 1 US Hungary 1996
# 2 US Hungary 1997
# 3 US Hungary 1998
# 4 US Hungary 1999
# .
# .
# .
# 39 UK Iceland 2001
# 40 UK Iceland 2002
# 41 UK Iceland 2003
# 42 UK Iceland 2004
# 43 UK Iceland 2005
And your data in a DataFrame:
print(df.head())
# A B C D E F G H
# 0 0 US BENIN 1995 5 10 15 40
# 1 1 US BENIN 1996 6 12 12 12
# 2 2 US BENIN 2000 4 13 12 12
# 3 3 US Hungary 1998 5 19 23 23
# 4 4 US Hungary 1999 3 23 12 3
This should do what you requested:
# Concatenate the data
df = pd.concat([df, df_to_add])
df = df.sort_values(by=['B', 'C', 'D']).reset_index(drop=True)
df['A'] = df.index.values
# remove nan duplicates
def filter_dup_nans(df: DataFrame, row):
# if it's a duplicate
if df[(df['B'] == row['B']) & (df['C'] == row['C']) & (df['D'] == row['D'])].shape[0] > 1:
# return false if it's the nan one
return not row.isnull().values.any()
# not a duplicate -> don't remove it
return True
to_remove = list(filter(lambda i: i >= 0, map(lambda row: row[0] if not filter_dup_nans(
df, row[1]) else -1, df.iterrows())))
df = df.drop(to_remove).reset_index(drop=True)
df['A'] = df.index.values
print(df)
Prints:
A B C D E F G H
0 0 UK Chile 1995 NaN NaN NaN NaN
1 1 UK Chile 1996 NaN NaN NaN NaN
2 2 UK Chile 1997 NaN NaN NaN NaN
3 3 UK Chile 1998 NaN NaN NaN NaN
4 4 UK Chile 1999 NaN NaN NaN NaN
5 5 UK Chile 2000 5.0 10.0 15.0 40.0
6 6 UK Chile 2001 NaN NaN NaN NaN
7 7 UK Chile 2002 6.0 12.0 12.0 12.0
8 8 UK Chile 2003 NaN NaN NaN NaN
9 9 UK Chile 2004 4.0 13.0 12.0 12.0
10 10 UK Chile 2005 NaN NaN NaN NaN
11 11 UK Iceland 1995 NaN NaN NaN NaN
12 12 UK Iceland 1996 NaN NaN NaN NaN
13 13 UK Iceland 1997 NaN NaN NaN NaN
14 14 UK Iceland 1998 NaN NaN NaN NaN
15 15 UK Iceland 1999 NaN NaN NaN NaN
16 16 UK Iceland 2000 NaN NaN NaN NaN
17 17 UK Iceland 2001 NaN NaN NaN NaN
18 18 UK Iceland 2002 NaN NaN NaN NaN
19 19 UK Iceland 2003 NaN NaN NaN NaN
20 20 UK Iceland 2004 5.0 19.0 23.0 23.0
21 21 UK Iceland 2005 3.0 23.0 12.0 3.0
22 22 US BENIN 1995 5.0 10.0 15.0 40.0
23 23 US BENIN 1996 6.0 12.0 12.0 12.0
24 24 US BENIN 1997 NaN NaN NaN NaN
25 25 US BENIN 1998 NaN NaN NaN NaN
26 26 US BENIN 1999 NaN NaN NaN NaN
27 27 US BENIN 2000 4.0 13.0 12.0 12.0
28 28 US BENIN 2001 NaN NaN NaN NaN
29 29 US BENIN 2002 NaN NaN NaN NaN
30 30 US BENIN 2003 NaN NaN NaN NaN
31 31 US BENIN 2004 NaN NaN NaN NaN
32 32 US BENIN 2005 NaN NaN NaN NaN
33 33 US Hungary 1995 NaN NaN NaN NaN
34 34 US Hungary 1996 NaN NaN NaN NaN
35 35 US Hungary 1997 NaN NaN NaN NaN
36 36 US Hungary 1998 5.0 19.0 23.0 23.0
37 37 US Hungary 1999 3.0 23.0 12.0 3.0
38 38 US Hungary 2000 NaN NaN NaN NaN
39 39 US Hungary 2001 NaN NaN NaN NaN
40 40 US Hungary 2002 NaN NaN NaN NaN
41 41 US Hungary 2003 NaN NaN NaN NaN
42 42 US Hungary 2004 NaN NaN NaN NaN
43 43 US Hungary 2005 NaN NaN NaN NaN
I have a dataframe (df), that I break down into 4 new dfs (media, client, code_type, and date). media has one column of null values, while the other three are only 1-dim dfs, each consisting of nulls. After replacing the nulls in each dataframe, I try to pd.concatto get a single df and get the result below.
code_type
0 P
1 P
2 P
3 P
4 P
5 P
code_name media_type acq. revenue
0 RASH NaN 50.0 34004.0
1 100 NaN 10.0 1035.0
2 NEWS NaN 61.0 3475.0
3 DR NaN 53.0 4307.0
4 SPORTS NaN 45.0 6503.0
5 DOUBL NaN 13.0 4205.0
client_id
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
5 2.0
date
0 2016-08-15
1 2016-08-15
2 2016-08-15
3 2016-08-15
4 2016-08-15
5 2016-08-15
I pd.merge media with another a separate df to replace the NaNs under media.media_type, which appends a new media_type_y
code_name media_type_x acq. revenue media_type_y
0 RASH NaN 282 34004.0 Radio
1 100 NaN 119 1035.0 NaN
2 NEWS NaN 81 3475.0 SiriusXM
3 DR NaN 33 4307.0 SiriusXM
4 SPORTS NaN 25 6503.0 SiriusXM
5 DOUBL NaN 23 4205.0 Podcast
I then drop media_type_x and rename media_type_y to just media_type
final = m.loc[:,('code_name','media_type_y', 'acquisition', 'revenue')]
final = final.rename(columns={'media_type_y': 'media_type'})
So that when I concatenate, I have a complete df.
clean = pd.concat([media, client, code_type, date], axis=1)
code media acq. revenue client code_type date
0 RASH Radio 50.0 34004.0 NaN NaN NaT
1 100 NaN 10.0 1035.0 NaN NaN NaT
2 NEWS SiriusXM 61.0 3475.0 NaN NaN NaT
3 DR SiriusXM 53.0 4307.0 NaN NaN NaT
4 SPORTS SiriusXM 45.0 6503.0 NaN NaN NaT
5 DOUBL Podcast 13.0 4205.0 NaN NaN NaT
clean.client is supposed to be all 2
clean.code_type should be all P
clean.date should be all 08/15/2016
The dfs by themselves show the data, it's only when I concatenate that I lose the information. I think it may be something with the indexes, but I'm not sure. Could also be something to do with the fact that I have a column with both str and int (see clean.code above) which might be why I get the runtime error listed below.
//anaconda/lib/python3.5/site-packages/pandas/indexes/api.py:71: RuntimeWarning: unorderable types: int() < str(), sort order is undefined for incomparable objects
result = result.union(other)
Starting with this:
code_name media_type acq. revenue
0 RASH Radio 50.0 34004.0
1 100 NaN 10.0 1035.0
2 NEWS SiriusXM 61.0 3475.0
3 DR SiriusXM 53.0 4307.0
4 SPORTS SiriusXM 45.0 6503.0
5 DOUBL Podcast 13.0 4205.0
Try this:
df['client_id'] = 2
df['date'] = '08/15/2016'
df['code_type'] = 'P'
df
code_name media_type acq. revenue client_id date code_type
0 RASH Radio 50.0 34004.0 2 08/15/2016 P
1 100 NaN 10.0 1035.0 2 08/15/2016 P
2 NEWS SiriusXM 61.0 3475.0 2 08/15/2016 P
3 DR SiriusXM 53.0 4307.0 2 08/15/2016 P
4 SPORTS SiriusXM 45.0 6503.0 2 08/15/2016 P
5 DOUBL Podcast 13.0 4205.0 2 08/15/2016 P