Trim each column values at pandas - python

I am working on .xls files after import data to a data frame with pandas, need to trim them. I have a lot of columns. Each data starting xxx: or yyy: and in a column
for example:
xxx:abc yyy:def \n
xxx:def yyy:ghi \n
xxx:ghi yyy:jkl \n
...
I need to trim that xxx: and yyy: for each column. Researched and tried some issue solves but they doesn't worked. How can I trim that, I need an effective code. Already thanks.
(Unnecessary chars don't have static length I just know what are them look like stop words. For example:
['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB', ...]
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB', ...]
i want to new dataset look like:
['Apple', 'iPhone', '2018', '128GB', ...]
['Samsung', 'Note', '2017', '64GB', ...]
So I want to trim ('Comp:', 'Product:', 'Year:', ...) stop words for each column.

You can use pd.Series.str.split for this:
import pandas as pd
df = pd.DataFrame([['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB'],
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB']],
columns=['Comp', 'Product', 'Year', 'Memory'])
for col in ['Comp', 'Product', 'Year']:
df[col] = df[col].str.split(':').str.get(1)
# Comp Product Year Memory
# 0 Apple iPhone 2018 128GB
# 1 Samsung Note 2017 64GB

Related

Insert rows in Python dataframe with conditions

I have a large data file as shown below.
Edited to include an updated example:
I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.
I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!
This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.
import pandas as pd
import numpy as np
# dictionary of your data
companies = {
'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed.
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns
# Edit here to TEST the data types
for r in range(0, 5):
v = df['Comp ID'].values[r]
print(f'this "{v}" is a ', type(v))
# So this will tell us the data types so we can construct our where(). Back to prior answer....
# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
output
Comp ID
Type
Address
Suite
City/State
phone
C1
W_sale
10 foo
Suite A
foo city
888-321-4567
C2
W_sale
11 spam
STE 100
spam town
888-321-4567
C3
W_sale
12 ham
Myhammy
888-321-4567
Edit: the numpy where statement:
numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.
df1(return value) = np.where(df[test] > condition, true, false)
The pandas backfill
Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().

How to remove duplicates based on partial match

I don't even know how to approach it as it feels too complex for my level.
Imagine courier tracking numbers and I am receiving some duplicated updates from upstream system in following format:
see attached image or a small piece of code that creates such table:
import pandas as pd
incoming_df = pd.DataFrame({
'Tracking ID' : ['4845','24345', '8436474', '457453', '24345-S2'],
'Previous' : ['Paris', 'Lille', 'Paris', 'Marseille', 'Dijon'],
'Current' : ['Nantes', 'Dijon', 'Dijon', 'Marseille', 'Lyon'],
'Next' : ['Lyone', 'Lyon', 'Lyon', 'Rennes', 'NICE']
})
incoming_df
Obviously, tracking ID 24345-S2 (green arrow) is a duplication of 24345 (red arrow), however, it is not fully duplicated but a newer, updated location information (with history) for the parcel. How do I delete old line 24345 and keep new line 24345-S2 in the data set?
The length of tracking ID can be from 4 to 20 chars but '-S2' is always helpfully appended.
Thank you!
Edit: New solution:
# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()
# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]
If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:
# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])
# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')

Extract data from specific format in Pandas DF

I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)
IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)
You can use the explode function
df = df.apply(pd.Series.explode)

Add keys from dicts (in column) to new column

I have a DataFrame with a 'budgetYearMap' column, which has 1-3 key-value pairs for each record. I'm a bit stuck as to how I'm supposed to make a new column containing only the keys of the "budgetYearMap" column.
Sample data below:
df_sample = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'budgetYearMap': [{'0': 188650000}, {'2017': 188650000}, {'2015': 188650000}, {'2014': 188650000}, {'2020': 188650000, '2014': 188650000, '2012': 188650000}]
})
First I tried to extract the keys by position, then make a list out of them and add the list to the dataframe. As some records contained multiple keys (I then found out), this approach failed.
all_keys = [i for s in [list(d.keys()) for d in df_sample.budgetYearMap] for i in s]
df_TD_selected['budgetYear'] = all_keys
My problem is that extracting the keys by "name" wouldn't work either, given that the names of the keys are variable, and I do not know the set of years in advance. The data set will keep growing. It can be either 0 or a year within the 2000 range now, but in the future more years will be added.
My desired output would be:
df_output = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'Year': ['0', '2017', '2015', '2014', '2020, 2014, 2012']
})
Any idea how I should approach this?
Perfect pipeline use-case.
df = (
df_sample
.assign(Year = df_sample['budgetYearMap'].apply(lambda s: list(s.keys())))
.drop(columns = ['budgetYearMap'])
)
.assign creates a new column which takes the 'budgetYearMap' Series and applies the lambda function to it. This returns the dictionary's keys in a list. If you prefer a string (as in your desired output), simply replace the lambda function with
lambda s: ', '.join(list(s.keys()))

How to split text into columns with multiple-character delimiters (<=>, <br>)?

I have a pd.dataframe with a cell containing lots of information, separated by some custom delimiters. I want to split this information into separate columns. Sample cell looks like this:
price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen
You can notice that at the end of this example there are also '<->' separators, separating some features within one column. I am ok with keeping them inside one column for now.
So my Dataframe looks somewhat like this:
A B
0 1 price<=>price<br>price<=>3100<br>(...)
1 2 price<=>price<br>price<=>54000<br>(...)
2 3 price<=>price<br>price<=>135600<br>(...)
So the pattern I can see is that:
column names are in between: '< br >' and <=>
values are in between: <=> and '< br >'
Is there any smooth way to do this in python? Ideally, I would like to have a solution that splits and puts all values into columns. I could do the column names manually then.
The desired output would be like this:
A price price[currency] rent (...)
0 1 3100 PLN 600 (...)
1 2 54000 CZK 1000 (...)
2 3 135600 EUR 8000 (...)
use str.split() method to split the data on <br> then split the chunks on <=>
str_ = 'price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen'
#list of str that looks like "<column><=><value>"
row_list = str_.split('<br>')
#split the string on "<=>" and save the resulting column value pair in a new list
row_cleaned = [row.split('<=>') for row in row_list]
#convert the list of column value pairs to a column list and val list
column_list, vals_list = zip(*row_cleaned)
print(column_list)
print(vals_list)
column_list:
('price', 'price', 'price[currency]', 'rent', 'rent', 'rent[currency]', 'deposit', 'deposit', 'deposit[currency]', 'm', 'rooms_num', 'building_type', 'floor_no', 'building_floors_num', 'building_material', 'windows_type', 'heating', 'build_year', 'construction_status', 'free_from', 'rent_to_students', 'equipment_types', 'security_types', 'media_types', 'extras_types')
val_list:
('price', '3100', 'PLN', 'price', '600', 'PLN', 'price', '', '', '100', '3', 'tenement', 'floor_2', '4', 'brick', 'plastic', 'gas', '1915', 'ready_to_use', '', '', '', '', 'cable-television<->internet<->phone', 'balcony<->basement<->separate_kitchen')

Categories

Resources