How to transform combinations of values in columns into individual columns?

How to transform combinations of values in columns into individual columns? - python

I have a dataset (df), that looks like this:
Date
ID
County Name
State
State Name
Product Name
Type of Transaction
QTY
202105
10001
Los Angeles
CA
California
Shoes
Entry
630
202012
10002
Houston
TX
Texas
Keyboard
Exit
5493
202001
11684
Chicago
IL
Illionis
Phone
Disposal
220
202107
12005
New York
NY
New York
Phone
Entry
302
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
Shoes
Exit
201
For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.
I want to recreate this dataset, such that:
1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN.
2 - The product names and their types are their own columns.
Essentially, this is how the dataset needs to look:
Date
ID
County Name
State
State Name
Shoes, Entry
Shoes, Exit
Shoes, Disposal
Phones, Entry
Phones, Exit
Phones, Disposal
Keyboard, Entry
Keyboard, Exit
Keyboard, Disposal
202105
10001
Los Angeles
CA
California
594
694
5660
33299
1110
5659
4559
3223
56889
202012
10002
Houston
TX
Texas
3420
4439
549
2110
5669
2245
39294
3345
556
202001
11684
Chicago
IL
Illionis
55432
4439
329
21190
4320
455
34059
44556
5677
202107
12005
New York
NY
New York
34556
2204
4329
11193
22345
43221
1544
3467
22450
...
...
...
...
...
...
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
54543
23059
3290
21394
34335
59660
NaN
NaN
NaN
Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!

This is essentially a pivot, with flattening of the MultiIndex:
(df
.pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
columns=['Product Name', 'Type of Transaction'],
values='QTY')
.pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
.reset_index()
)
Output:
Date ID County Name State State Name Shoes,Entry Keyboard,Exit \
0 202001 11684 Chicago IL Illionis NaN NaN
1 202012 10002 Houston TX Texas NaN 5493.0
2 202105 10001 Los Angeles CA California 630.0 NaN
3 202107 12005 New York NY New York NaN NaN
Phone,Disposal Phone,Entry
0 220.0 NaN
1 NaN NaN
2 NaN NaN
3 NaN 302.0

Related

Swap df1 column with df2 column, based on value

Goal: swap out df_hsa.stateabbr with df_state.state, based on 'df_state.abbr`.
Is there such a function, where I mention source, destination, and based-on dataframe columns?
Do I need to order both DataFrames similarly?
df_hsa:
hsa stateabbr county
0 259 AL Butler
1 177 AL Calhoun
2 177 AL Cleburne
3 172 AL Chambers
4 172 AL Randolph
df_state:
abbr state
0 AL Alabama
1 AK Alaska
2 AZ Arizona
3 AR Arkansas
4 CA California
Desired Output:
df_hsa with state column instead of stateabbr.
hsa state county
0 259 Alabama Butler
1 177 Alabama Calhoun
2 177 Alabama Cleburne
3 172 Alabama Chambers
4 172 Alabama Randolph

you can simply join after setting the index to be "stateabbr"
df_hsa.set_index("stateabbr").join(df_state.set_index("abbr"))
output:
hsa county state
AL 259 Butler Alabama
AL 177 Calhoun Alabama
AL 177 Cleburne Alabama
AL 172 Chambers Alabama
AL 172 Randolph Alabama
if you also want the original index your can add .set_index(df_hsa.index) at the end of the line

Pandas Dataframe replace string value based on condition AND using original value

I have a dataframe that looks like this:
YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK ORIGIN_CITY_NAME ORIGIN_STATE_ABR DEST_CITY_NAME DEST_STATE_ABR DEP_TIME DEP_DELAY_NEW ARR_TIME ARR_DELAY_NEW CANCELLED AIR_TIME
0 2020 1 1 3 Ontario CA San Francisco CA 1851 41 2053 68 0 74
1 2020 1 1 3 Ontario CA San Francisco CA 1146 0 1318 0 0 71
2 2020 1 1 3 Ontario CA San Jose CA 2016 0 2124 0 0 57
3 2020 1 1 3 Ontario CA San Jose CA 1350 10 1505 10 0 63
4 2020 1 1 3 Ontario CA San Jose CA 916 1 1023 0 0 57
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
607341 2020 1 16 4 Portland ME New York NY 554 0 846 65 0 57
607342 2020 1 17 5 Portland ME New York NY 633 33 804 23 0 69
607343 2020 1 18 6 Portland ME New York NY 657 0 810 0 0 55
607344 2020 1 19 7 Portland ME New York NY 705 5 921 39 0 54
607345 2020 1 20 1 Portland ME New York NY 628 0 741 0 0 52
I am trying to modify columns DEP_TIME and ARR_TIME so that they have the format hh:mm. All values should be treated as strings. There are also null values present in some rows that need to be accounted for. Performance is also of consideration (albeit secondary in relation to solving the actual problem) since I need to change about 10M records total.
The challenge in this problem to me is figuring out how to modify these values iteratively based on a condition while also having access to the original value when replacing it. I simply could not find a solution for that specific problem elsewhere. Most problems are using a known constant to replace.
Thanks for your help.

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.

arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Pandas dataframe Split One column data into 2 using some condition

I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?

Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth

Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth

pandas join gives NaN values

I want to join 2 DataFrames
Zipcode Database (first 10 entries)
0 zip_code City State County Population
0 0 90001 Los Angeles California Los Angeles 54481
1 1 90002 Los Angeles California Los Angeles 44584
2 2 90003 Los Angeles California Los Angeles 58187
3 3 90004 Los Angeles California Los Angeles 67850
4 4 90005 Los Angeles California Los Angeles 43014
5 5 90006 Los Angeles California Los Angeles 62765
6 6 90007 Los Angeles California Los Angeles 45021
7 7 90008 Los Angeles California Los Angeles 30840
8 8 90009 Los Angeles California Los Angeles -
9 9 90010 Los Angeles California Los Angeles 1943
And data (first 10 entries)
buyer zip_code
0 SWEENEY,THOMAS R & MICHELLE H NaN
1 DOUGHERTY,HERBERT III & JENNIFER M NaN
2 WEST COAST RLTY SVCS INC NaN
3 LOVE,JULIE M NaN
4 SAHAR,DAVID NaN
5 SILBERSTERN,BRADLEY E TRUST 91199
6 LEE,SUSAN & JIMMY C 92025
7 FRAZZANO REAL ESTATE I NC NaN
8 RUV INVESTMENTS LLC 91730
9 KAOS KAPITAL LLC NaN
So the final table should have [buyer, zip_code, City, County]. I'm joining with respect to Zip code.
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
But the city and county columns are NaN even for the tuples in data where zipcode is actually present.
buyer zip_code City County
10 LANDON AVE TRUST 37736 NaN NaN NaN
11 UMAR,AHMAD NaN NaN NaN
12 3 JPS INC 90717 NaN NaN
13 T & L HOLDINGS INC 95610 NaN NaN
14 CAHP HOLDINGS LLC 90808 NaN NaN
15 REBUILDING TOGETHER LONG BEACH 92344 NaN NaN
16 COLFIN AI-CA 4 LLC NaN NaN NaN
17 GUTIERREZ,HUGO 91381 NaN NaN
18 VALBRIDGE CAP GOLDEN GATE FUND NaN NaN NaN
19 SOLARES,OSCAR 92570 NaN NaN
Why is this the case? The zipcode database has all zipcodes from 90001 - 999950.
My first thought is the datatype of "zip_code" in both are different:
print(zipcode_database['zip_code'].dtype)
print(data['zip_code'].dtype)
Output:
int64
object
Thought of typecasting with astype, but this does not work with NaN values. Any thoughts?

You can cast NaN values to float types, but not int. In your case I would cast the zip_code field in both DataFrames to a float and then join.
zipcode_database.zip_code = zipcode_database.zip_code.astype(float)
data.zip_code = data.zip_code.astype(float)
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')
I can't reproduce anything meaningful from your example data (no matching zip codes), but that should fix the issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.