How do I combine dataframe columns - python

I've a dataframe df that looks like:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810 entries, 0 to 809
Data columns (total 21 columns):
event_type 810 non-null object
datetime 810 non-null datetime64[ns]
person 810 non-null object
...
from_file 0 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(2), object(16)
memory usage: 133.0+ KB
(There are 21 columns but only the above four I'm interested in so I've omitted them)
I want to create a second dataframe df_b that has two columns where one of them is a combination of df's event_type,person,from_file columns and the other is df's datetime. Did I explain that well?... (so two columns in df_b from df's four but where three of the above are combined into one of df_b's)
I thought of creating a new dataframe df_b as:
df_b = pandas.DataFrame({'event_type+person+from_file': [], 'datetime': []})
Then selecting all rows with:
df.loc[:, ['event_type','person','from_file','datetime']]
But beyond that I don't know how to achieve the remainder and I keep thinking I'm going to end up with datetime values that didn't correspond to the original row's datetime that was pulled from df.
So can you show me how to:
select: event_type, person, from_file, datetime from all rows in df
combine: event_type, person, from_file with '+' between the values
and then put (event_type+person+from_file), datetime into df_b
?

To drop NaN values use:
df_clean = df.dropna(subset=['event_type', 'person', 'from_file'])
Concatenating string columns in Pandas is as easy as
df_clean['event_type+person+from_file'] = df_clean['event_type'] + '+' +
df_clean['person'] + '+' + df_clean['from_file']
And then:
df_b = df_clean[['event_type+person+from_file', 'datetime']].copy()

Related

A merge in pandas is returning only NaN values

I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!

How can I join two dataframes where one column holds two or more values

I have two dataframes similar to this:
A = pd.DataFrame(data={"number": [123, 345], "subject_ids": []})
B = pd.DataFrame(data={"number": [123, 123, 345, 345], "subject_id": [222, 333, 444, 555]})
Meaning: Every number has at least two subject ids.
How would you go about merging these dataframes together, so there would be column "subject_ids" in the A dataframe containing joined list of ids in one cell?
"number": [123, 345], "subject_ids": [[222, 333], [444, 555]]
I've tried lots of methods like this:
A.merge(B, how='left', on='number')
But nothing seems to work. (I couldn't find an answer to this either)
The number is a key and those keys are identical, and the second df stores subjects to those numbers. I want the A dataframe to contain a reference to those subject IDs in a list assigned to one row with that given number. One number can have many subjects.
Complaint dataframe where I want the column with list of all subject IDs associated with the number:
number total_complaint_count first_complaint_on last_complaint_on
0 0000000000 77 2021-10-29 2021-12-05
77 00000000000 1 2021-11-12 2021-11-12
78 000000000000 1 2021-11-07 2021-11-07
79 00020056234 1 2021-11-23 2021-11-23
80 0002266648 1 2021-11-02 2021-11-02
Subject dataframe that contains the number to be associated with, subject and subject ID.
number subject \
787 0000000000 Other
4391 0000000000 Calls pretending to be government, businesses,...
694 0000000000 Warranties & protection plans
1106 0000000000 Other
4682 0000000000 Dropped call or no message
subject_id
787 38d1177e-51e8-4cec-aef8-0112f425091b
4391 1964fb22-bd20-4d49-beaf-51322a5f5bad
694 07819535-41b0-44f3-a497-ac2cee16dd1a
1106 2f348025-3f9f-4861-b151-fbb8a1ac14a3
4682 15d33ca0-6d90-42ba-9a1d-74e0dcf28539
Info of both dataframes:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 230122 entries, 0 to 281716
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 230122 non-null object
1 total_complaint_count 230122 non-null int64
2 first_complaint_on 19 non-null object
3 last_complaint_on 19 non-null object
dtypes: int64(1), object(3)
memory usage: 8.8+ MB
---------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281720 entries, 787 to 9377
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 281720 non-null object
1 subject 281720 non-null object
2 subject_id 281720 non-null object
dtypes: object(3)
memory usage: 8.6+ MB
Pretty sure I have the answer for you this time.
dfa = pd.DataFrame(columns=['number'],
data=np.array([[123],
[345]
]))
dfb = pd.DataFrame(columns=['number', 'subject ids'],
data=np.array([[123, 222],
[123, 333],
[345, 444],
[345, 555]
]))
dfa['ids'] = '' #create ids column in dfa
for x in dfa.itertuples():
list = []
for a in dfb.itertuples():
if x[1] == a[1]:
print(a[2])
list.append(a[2])
#x[1] shows first column items from dfa, a[1] from dfb
#if the values match
#get value from column['subject id'] in dfb and add to an empty list
slist = str(list) #change list to string
dfa.loc[x[0], ['ids']] = slist #append to 'id' column at the index where the values match
print(dfa)
i dont know how to quote the output of the table but the code above is copy paste aside from the imports
I tried to keep the list format to no avail. Tried using lamda functions and setting the column .astype(object, included dtype=object in the dataframe. Along with a bunch of other ways.
if someone else knows how to keep the list as a list and add it to the dataframe using the code above I would love to know as well
I found a solution in this post:
How to implode(reverse of pandas explode) based on a column
I simply grouped by the number column, added the values to the list, and merged the data frames.
Here is the code if somebody needs it:
def create_subject_id_column(complaint_df, subject_df, subject_column="subject", number_column="number"):
subject_df = subject_df.copy()
subject_df.drop(subject_column, axis=1, inplace=True)
subject_df = (subject_df.groupby(number_column)
.agg({'subject_id': lambda x: x.tolist()})
.reset_index())
combined_df = complaint_df.merge(subject_df, how="outer", on=number_column)
return combined_df

Sorting a Pandas dataframe

I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB
'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)

Pandas, dataframe with a datetime64 column, querying by hour

I have a pandas dataframe df which has one column constituted by datetime64, e.g.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1471 entries, 0 to 2940
Data columns (total 2 columns):
date 1471 non-null values
id 1471 non-null values
dtypes: datetime64[ns](1), int64(1)
I would like to sub-sample df using as criterion the hour of the day (independently on the other information in date). E.g., in pseudo code
df_sub = df[ (HOUR(df.date) > 8) & (HOUR(df.date) < 20) ]
for some function HOUR.
I guess the problem can be solved via a preliminary conversion from datetime64 to datetime. Can this be handled more efficiently?
Found a simple solution.
df['hour'] = df.date.apply(lambda x : x.hour)
df_sub = df[(df.hour > 8) & (df.hour) <20]
EDIT:
There is a property dt specifically introduced to handle this problem. The query becomes:
df_sub = df[ (df.date.dt.hour > 8)
& (df.date.dt.hour < 20) ]

Is there a way to group by logical comparison of two columns in Pandas?

I have a a dataframe with the following structure:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1152 entries, 0 to 143
Data columns:
cuepos 1152 non-null values
response 1152 non-null values
soa 1152 non-null values
targetpos 1152 non-null values
testorientation 1152 non-null values
dtypes: float64(3), int64(2)
The cuepos column and the targetpos column both contain integer values of either 1 or 2.
I would like to group this data by congruency between cuepos and targetpos. In other words, I would like to produce two groups, one for rows in which cuepos == targetpos and another group for which cuepos != targetpos.
I can't seem to figure out how I might do this. I looked at using grouping functions, but these seem only to act on a single column... or am I mistaken? Can someone point me in the right direction?
Thanks in advance!
Blz
Note, if you goal is to do group computations, you can do
df.groupby(df.col1 == df.col2).apply(f)
and the result will be keyed by True/False.
you can group by multiple columns:
df.groupby(['col1', 'col2']).apply(lambda x: x['col1'] == x['col2'], axis=1)
you can also use a mask:
df[df.col1==df.col2]

Categories

Resources