I have a pandas dataframe with output scraped directly from a USDA text file. Below is an example of of the dataframe:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans,Cucumbers,Eggplant,Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples and Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
I am looking for a programmatic solution to break up the multiple commodity (each commodity is separated by commas or "and" in the above table) cells in the "CommodityGroups" column, create new rows for the separated commodities, and duplicate the rest of column data for each new row. Desired example output:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
Any guidance you can provide in this pursuit will be greatly appreciated!
Use .str.split to split the column with a pattern ',| and ', which is ',' or ' and '. '|' is OR.
Use .explode to separate list elements into separate rows
Optionally, set ignore_index=True where the resulting index will be labeled 0, 1, …, n - 1, depending on your needs.
import pandas as pd
# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
'Low': [4500, 7000, 3800],
'High': [4700, 8000, 4000]}
# create the dataframe
df = pd.DataFrame(data)
# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')
# explode the CommodityGroup lists
df = df.explode('CommodityGroup')
# final
Date Region CommodityGroup InboundCity Low High
0 1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1 1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1 1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
2 1/2/2019 Michigan Apples Boston 3800 4000
You can try this:
df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
.apply(lambda x: x.str.split(',| and ').explode())
.reset_index()
print(df)
Date Region InboundCity Low High CommodityGroup
0 1/2/2019 Mexico Crossings Atlanta 4500 4700 Beans
1 1/2/2019 Mexico Crossings Atlanta 4500 4700 Cucumbers
2 1/2/2019 Mexico Crossings Atlanta 4500 4700 Eggplant
3 1/2/2019 Mexico Crossings Atlanta 4500 4700 Melons
4 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Apples
5 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Pears
6 1/2/2019 Michigan Boston 3800 4000 Apples
Related
I have a dataset (df), that looks like this:
Date
ID
County Name
State
State Name
Product Name
Type of Transaction
QTY
202105
10001
Los Angeles
CA
California
Shoes
Entry
630
202012
10002
Houston
TX
Texas
Keyboard
Exit
5493
202001
11684
Chicago
IL
Illionis
Phone
Disposal
220
202107
12005
New York
NY
New York
Phone
Entry
302
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
Shoes
Exit
201
For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.
I want to recreate this dataset, such that:
1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN.
2 - The product names and their types are their own columns.
Essentially, this is how the dataset needs to look:
Date
ID
County Name
State
State Name
Shoes, Entry
Shoes, Exit
Shoes, Disposal
Phones, Entry
Phones, Exit
Phones, Disposal
Keyboard, Entry
Keyboard, Exit
Keyboard, Disposal
202105
10001
Los Angeles
CA
California
594
694
5660
33299
1110
5659
4559
3223
56889
202012
10002
Houston
TX
Texas
3420
4439
549
2110
5669
2245
39294
3345
556
202001
11684
Chicago
IL
Illionis
55432
4439
329
21190
4320
455
34059
44556
5677
202107
12005
New York
NY
New York
34556
2204
4329
11193
22345
43221
1544
3467
22450
...
...
...
...
...
...
...
...
...
...
...
...
...
...
202111
14990
Orlando
FL
Florida
54543
23059
3290
21394
34335
59660
NaN
NaN
NaN
Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!
This is essentially a pivot, with flattening of the MultiIndex:
(df
.pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
columns=['Product Name', 'Type of Transaction'],
values='QTY')
.pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
.reset_index()
)
Output:
Date ID County Name State State Name Shoes,Entry Keyboard,Exit \
0 202001 11684 Chicago IL Illionis NaN NaN
1 202012 10002 Houston TX Texas NaN 5493.0
2 202105 10001 Los Angeles CA California 630.0 NaN
3 202107 12005 New York NY New York NaN NaN
Phone,Disposal Phone,Entry
0 220.0 NaN
1 NaN NaN
2 NaN NaN
3 NaN 302.0
I need to match the identical fields of two columns from two separate dataframes, and rewrite the original dataframe, considering the another one.
So I have this original df:
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Vienna
3 Toyota Zurich
4 Renault Sydney
5 Ford Toronto
6 BMW Hamburg
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat San Francisco
11 Audi New York City
12 Ferrari Oslo
13 Volkswagen Stockholm
14 Lamborghini Singapore
15 Mercedes Lisbon
16 Jaguar Boston
And this new df:
Car Brand Current City
0 Tesla Amsterdam
1 Renault Paris
2 BMW Munich
3 Fiat Detroit
4 Audi Berlin
5 Ferrari Bruxelles
6 Lamborghini Rome
7 Mercedes Madrid
I need to match the car brands that are identical within the above two dataframes and write the new associate city in the original df, so the result should be this one: (so for example Tesla is now Amsterdam instead of Vienna)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
I tried with this code for mapping the columns and rewrite the field, but it doesn't really work and I cannot figure out how to make it work:
original_df['Original City'] = original_df['Car Brand'].map(dict(corrected_df[['Car Brand', 'Current City']]))
How to make it work ? Thanks a lot!!!!
P.S.: Code for df:
cars = ['Daimler', 'Mitsubishi','Tesla', 'Toyota', 'Renault', 'Ford','BMW', 'Audi Sport','Citroen', 'Chevrolet', 'Fiat', 'Audi', 'Ferrari', 'Volkswagen','Lamborghini', 'Mercedes', 'Jaguar']
cities = ['Chicago', 'LA', 'Vienna', 'Zurich', 'Sydney', 'Toronto', 'Hamburg', 'Helsinki', 'Dublin', 'Brisbane', 'San Francisco', 'New York City', 'Oslo', 'Stockholm', 'Singapore', 'Lisbon', 'Boston']
data = {'Original Car Brand': cars, 'Original City': cities}
original_df = pd.DataFrame (data, columns = ['Original Car Brand', 'Original City'])
---
cars = ['Tesla', 'Renault', 'BMW', 'Fiat', 'Audi', 'Ferrari', 'Lamborghini', 'Mercedes']
cities = ['Amsterdam', 'Paris', 'Munich', 'Detroit', 'Berlin', 'Bruxelles', 'Rome', 'Madrid']
data = {'Car Brand': cars, 'Current City': cities}
corrected_df = pd.DataFrame (data, columns = ['Car Brand', 'Current City'])
Use Series.map with repalce not matched values by original column by Series.fillna:
s = corrected_df.set_index('Car Brand')['Current City']
original_df['Original City'] = (original_df['Original Car Brand'].map(s)
.fillna(original_df['Original City']))
print (original_df)
Original Car Brand Original City
0 Daimler Chicago
1 Mitsubishi LA
2 Tesla Amsterdam
3 Toyota Zurich
4 Renault Paris
5 Ford Toronto
6 BMW Munich
7 Audi Sport Helsinki
8 Citroen Dublin
9 Chevrolet Brisbane
10 Fiat Detroit
11 Audi Berlin
12 Ferrari Bruxelles
13 Volkswagen Stockholm
14 Lamborghini Rome
15 Mercedes Madrid
16 Jaguar Boston
Your solution should be changed with convert both columns to numpy array before dict:
d = dict(corrected_df[['Car Brand','Current City']].to_numpy())
original_df['Original City'] = (original_df['Original Car Brand'].map(d)
.fillna(original_df['Original City']))
You can use set_index() and assign() method:
resultdf=original_df.set_index('Original Car Brand').assign(OriginalCity=corrected_df.set_index('Car Brand'))
Finally use fillna() method and reset_index() method:
resultdf=resultdf['OriginalCity'].fillna(resultdf['Original City']).reset_index()
Let us try update
df1 = df1.set_index('Original Car Brand')
df1.update(df2.set_index('Car Brand'))
df1 = df1.reset_index()
Merge can do the work as well
original_df['Original City'] = original_df.merge(corrected_df,left_on='Original Car Brand', right_on='Car Brand',how='left')['Current City'].fillna(original_df['Original City'])
I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?
Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth
Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth
I have table like this
Province Country Date infected
New South Wales Australia 1/22/20 12
Victoria Australia 1/22/20 10
British Columbia Canada 1/22/20 5
USA 1/22/20 7
New South Wales Australia 1/23/20 6
Victoria Australia 1/23/20 2
British Columbia Canada 1/23/20 1
USA 1/23/20 10
Now I want to convert that table into like this
Province Country Date infected
New South Wales Australia 1/22/20 12
1/23/20 6
Victoria Australia 1/22/20 10
1/23/20 2
British Columbia Canada 1/22/20 5
1/23/20 1
USA 1/22/20 7
1/23/20 10
I have tried df.sort_values('Date') but no luck.
How can I implement such kind of table?
I'm a Python rookie, but let me think along (I'm sure this can be done neater).
df = df.fillna(method='ffill')
df = df.groupby(['Province', 'Country', 'Date']).sum()
This gave me:
Province Country Date infected
British Columbia Canada 1/22/20 5
1/23/20 1
USA 1/22/20 7
1/23/20 10
New South Wales Australia 1/22/20 12
1/23/20 6
Victoria Australia 1/22/20 10
1/23/20 2
I kind of anticipated you have NaN values in the empty places (at least it's what I had importing the dataframe). I changed all these NaN to values from the index above them.
Then a groupby gave me the result above. Not sure if this is what you were after, but maybe it sparked some ideas =)
dict = {"Province": ["New South Wales", "Victoria", "British Columbia", "", "New South Wales", "Victoria", "British Columbia", ""],
"Country": ["Australia", "Australia", "Canada", "USA", "Australia", "Australia", "Canada", "USA"],
"Date": ["1/22/20", "1/22/20", "1/22/20", "1/22/20", "1/23/20", "1/23/20", "1/23/20", "1/23/20"],
"infected": [12, 10, 6, 5, 2, 3, 4, 5] }
import pandas as pd
brics = pd.DataFrame(dict)
print(brics)
df = brics.set_index(['Country', 'Province', 'Date']).sort_values(['Country', 'Province', 'Date'])
print(df)
Output:
Province Country Date infected
0 New South Wales Australia 1/22/20 12
1 Victoria Australia 1/22/20 10
2 British Columbia Canada 1/22/20 6
3 USA 1/22/20 5
4 New South Wales Australia 1/23/20 2
5 Victoria Australia 1/23/20 3
6 British Columbia Canada 1/23/20 4
7 USA 1/23/20 5
infected
Country Province Date
Australia New South Wales 1/22/20 12
1/23/20 2
Victoria 1/22/20 10
1/23/20 3
Canada British Columbia 1/22/20 6
1/23/20 4
USA 1/22/20 5
1/23/20 5
I am attempting to parse a text file using python and regex to construct a specific pandas data frame. Below is a sample from the text file I am parsing and the ideal pandas DataFrame I am seeking.
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
The ideal output should look something like:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
With my limited understanding of creating regex statements, this is the closest I have come to successfully isolating the desired text: regex tester for USDA data
I have been trying to replicate the solution from How to parse complex text files using Python?1 where applicable but my regex experience is severely lacking. Any help you can provide will greatly appreciated!
I came up with this regex (txt is your text from the question):
import re
import numpy as np
import pandas as pd
data = {'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]}
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):
for val in values.strip().splitlines():
val = re.sub(r'(\d)\s{8,}.*', r'\1', val)
inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
data['Region'].append(region)
data['CommodityGroup'].append(commodity_group)
data['InboundCity'].append(inbound_city)
data['Low'].append(np.nan if low == '' else int(low))
data['High'].append(int(high))
df = pd.DataFrame(data)
print(df)
Prints:
Region CommodityGroup InboundCity Low High
0 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... ATLANTA 5100 5500
1 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BALTIMORE 6300 6600
2 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BOSTON 7000 7300
3 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... CHICAGO 4500 4900
4 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... DALLAS 3400 3800
5 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... MIAMI 6400 6700
6 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... NEW YORK 6600 6900
7 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... PHILADELPHIA 6400 6700
EDIT: Now should work even for your big document from the regex101