I've been trying to use the pd.merge function properly but I either receive an error or get the table formatted in a way I don't like. I looked through the documentation but I can't find a way to only merge a specific column. For instance lets say I'm working with these two dataframes.
df_1 = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
df_2 = county_name population
ADAMS 102336
ALLEGHENY 1223048
ARMSTRONG 65642
BEDFORD 166140
BERKS 48480
BLAIR 417854
BRADFORD 123457
BUCKS 60853
CAMBRIA 628341
The outcome im looking for is something like this. Where the county names are added to the 'county_name' column but not duplicated and the 'population' column is left off.
df_outcome = county_name accidents pedestrians
ADAMS 1 2
ALLEGHENY 1 3
ARMSTRONG 3 4
BEDFORD 1 1
BERKS Nan Nan
BLAIR Nan Nan
BRADFORD Nan Nan
BUCKS Nan Nan
CAMBRIA Nan Nan
Lastly, I plan to use df_outcome.fillna(0) to replace all the Nan values with zero.
Filter column county_name and use merge with left join:
df = df_2[['county_name']].merge(df_1, how='left')
print (df)
county_name accidents pedestrians
0 ADAMS 1.0 2.0
1 ALLEGHENY 1.0 3.0
2 ARMSTRONG 3.0 4.0
3 BEDFORD 1.0 1.0
4 BERKS NaN NaN
5 BLAIR NaN NaN
6 BRADFORD NaN NaN
7 BUCKS NaN NaN
8 CAMBRIA NaN NaN
Try:
df = pd.merge(df1,df2[['county_name']], how='left')
Related
I have 2 tables
date
James
Jamie
John
Allysia
Jean
2022-01-01
NaN
6
5
4
3
2022-01-02
7
6
7
NaN
5
names
groupings
James
guy
John
guy
Jamie
girl
Allysia
girl
Jean
girl
into
date
James
Jamie
John
Allysia
Jean
girl
guy
2022-01-01
NaN
6
5
4
3
5
5
2022-01-02
7
6
7
NaN
5
5.5
7
threshold= >3
I want to create a new column grouped by guys /girls scores where the score taken is above the threshold and get their mean while ignoring NaN and scores that does not fit the threshold.
I do not know on how to replace scores that is below the threshold with nan.
I tried doing to do a group by to get them in to a list and create new row with mean.
groupingseries = groupings.groupby(['grouping'])['names'].apply(list)
for k,s in zip(groupingseries.keys(),groupingseries):
try:
its='"'+',"'.join(s)+'"'
df[k]=df[s].mean()
except:
print('not in item')
Not sure why the results return NaN for girl and guy.
Please do help.
Assuming df and groupings your two input DataFrames:
out = df.join(df.groupby(df.columns.map(groupings.set_index('names')['groupings']),
axis=1).sum()
)
Output:
date James Jamie John Allysia Jean girl guy
0 2022-01-01 NaN 6 5 4.0 3 13.0 5.0
1 2022-01-02 7.0 6 7 NaN 5 11.0 14.0
I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN
I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg
Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
Name Sex Age Ticket_No Fare
0 Braund male 22 HN07681 2500
1 NaN female 42 HN05681 6895
2 peter male NaN KKSN55 800
3 NaN male 56 HN07681 2500
4 Daisy female 22 hf55s44 NaN
5 Manson NaN 48 HN07681 8564
6 Piston male NaN HN07681 5622
7 Racline female 42 Nh55146 NaN
8 Nan male 22 HN07681 4875
9 NaN NaN NaN NaN NaN
col_Name No_of_Missing Mean Median Mode
0 Name 3 NaN NaN NaN
1 Sex 1 NaN NaN NaN
2 Age 2 36 42 22
3 Fare 2 4536 4875 2500
Mean/Median/Mode is only for numerical datatype, otherwise should be null.
Try this:
# Your original df
print(df)
# First drop any rows which are completely NaN
df = df.dropna(how = "all")
# Create a list to hold other lists.
# This will be used as the data for the new dataframe
new_data = []
# Parse through the columns
for col in df.columns:
# Create a new list, which will be one row of data in the new dataframe
# The first item containing only the columns name,
# to correspond with the new df's first column
_list = [col]
_list.append(df.dtypes[col]) # DType for that colmn is the second item/second column
missing = df[col].isna().sum() # Total the number of "NaN" in column
if missing > 30:
print("Max total number of missing exceeded")
continue # Skip this columns and continue on to next column
_list.append(missing)
# Get the mean This will error and pass if it's not possible
try: mean = df[col].mean()
except:
mean = np.nan
_list.append(mean) # Append to proper columns position
# Get the median This will error and pass if it's not possible
try: median = df[col].median()
except:
median = np.nan
_list.append(median)
# Get the mode. This will error and pass if it's not possible
try: mode = df[col].mode()[1]
except:
mode = np.nan
_list.append(mode)
new_data.append(_list)
columns = ["col_Name", "DType", "No_of_Missing", "Mean", "Median", "Mode"]
new_df = pd.DataFrame(new_data, columns = columns)
print("============================")
print(new_df)
OUTPUT:
Name Sex Age Ticket_No Fare
0 Braund male 22.0 HN07681 2500.0
1 NaN female 42.0 HN05681 6895.0
2 peter male NaN KKSN55 800.0
3 NaN male 56.0 HN07681 2500.0
4 Daisy female 22.0 hf55s44 NaN
5 Manson NaN 48.0 HN07681 8564.0
6 Piston male NaN HN07681 5622.0
7 Racline female 42.0 Nh55146 NaN
8 NaN male 22.0 HN07681 4875.0
9 NaN NaN NaN NaN NaN
============================
col_Name DType No_of_Missing Mean Median Mode
0 Name object 3 NaN NaN Daisy
1 Sex object 1 NaN NaN NaN
2 Age float64 2 36.285714 42.0 NaN
3 Ticket_No object 0 NaN NaN NaN
4 Fare float64 2 4536.571429 4875.0 NaN
I have a dataFrame that looks like the following:
page_id content name
1 {} John
1 {cat, dog} Anne
2 {} Ethan
3 {} John
3 {sea, earth} Anne
3 {earth, green} Ethan
4 {} Mark
I need the value of the content column of each page_id to be equal to the value of the content column of the next page_id, only for the same page_ids. I suppose I need to use the shift() function al along with a group by page_id, but I don't know how to put it together.
The expected output would be:
page_id content name
1 {cat, dog} John
1 NaN Anne
2 NaN Ethan
3 {sea, earth} John
3 {earth, green} Anne
3 NaN Ethan
4 NaN Mark
Any help on this issue will be very appreciated.
Looks like you want a groupby with shift:
df['content'] = df.groupby('page_id').content.apply(lambda x: x.shift(-1))
page_id content
0 1.0 {cat, dog}
1 NaN NaN
2 NaN NaN
3 3.0 {earth, sea}
4 3.0 {green, earth}
5 NaN NaN
6 NaN NaN
You can avoid the groupby apply given your sorting on 'page_id'. shift everything then only set the values within group using where. This will be much faster as the number of groups becomes large.
df['content'] = df.content.shift(-1).where(df.page_id.eq(df.page_id.shift(-1)))
page_id content name
0 1 {cat, dog} John
1 1 NaN Anne
2 2 NaN Ethan
3 3 {earth, sea} John
4 3 {earth, green} Anne
5 3 NaN Ethan
6 4 NaN Mark
I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?
I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS
Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)
Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.