How to split text into columns with multiple-character delimiters (<=>, <br>)? - python

I have a pd.dataframe with a cell containing lots of information, separated by some custom delimiters. I want to split this information into separate columns. Sample cell looks like this:
price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen
You can notice that at the end of this example there are also '<->' separators, separating some features within one column. I am ok with keeping them inside one column for now.
So my Dataframe looks somewhat like this:
A B
0 1 price<=>price<br>price<=>3100<br>(...)
1 2 price<=>price<br>price<=>54000<br>(...)
2 3 price<=>price<br>price<=>135600<br>(...)
So the pattern I can see is that:
column names are in between: '< br >' and <=>
values are in between: <=> and '< br >'
Is there any smooth way to do this in python? Ideally, I would like to have a solution that splits and puts all values into columns. I could do the column names manually then.
The desired output would be like this:
A price price[currency] rent (...)
0 1 3100 PLN 600 (...)
1 2 54000 CZK 1000 (...)
2 3 135600 EUR 8000 (...)

use str.split() method to split the data on <br> then split the chunks on <=>
str_ = 'price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen'
#list of str that looks like "<column><=><value>"
row_list = str_.split('<br>')
#split the string on "<=>" and save the resulting column value pair in a new list
row_cleaned = [row.split('<=>') for row in row_list]
#convert the list of column value pairs to a column list and val list
column_list, vals_list = zip(*row_cleaned)
print(column_list)
print(vals_list)
column_list:
('price', 'price', 'price[currency]', 'rent', 'rent', 'rent[currency]', 'deposit', 'deposit', 'deposit[currency]', 'm', 'rooms_num', 'building_type', 'floor_no', 'building_floors_num', 'building_material', 'windows_type', 'heating', 'build_year', 'construction_status', 'free_from', 'rent_to_students', 'equipment_types', 'security_types', 'media_types', 'extras_types')
val_list:
('price', '3100', 'PLN', 'price', '600', 'PLN', 'price', '', '', '100', '3', 'tenement', 'floor_2', '4', 'brick', 'plastic', 'gas', '1915', 'ready_to_use', '', '', '', '', 'cable-television<->internet<->phone', 'balcony<->basement<->separate_kitchen')

Related

Returns a dataframe with a list of files containing the word

I have a dataframe:
business049.txt [bmw, cash, fuel, mini, product, less, mini]
business470.txt [saudi, investor, pick, savoy, london, famou]
business075.txt [eu, minist, mull, jet, fuel, tax, european]
business101.txt [australia, rate, australia, rais, benchmark]
business060.txt [insur, boss, plead, guilti, anoth, us, insur]
Therefore, I would like the output to include a column of words and a column of filenames that contain it. It should be like:
bmw [business049.txt,business055.txt]
australia [business101.txt,business141.txt]
Thank you
This is quite possibly not the most efficient/best way to do this, but here you go:
# Create DataFrame from question
df = pd.DataFrame({
'txt_file': ['business049.txt',
'business470.txt',
'business075.txt',
'business101.txt',
'business060.txt',
],
'words': [
['bmw', 'cash', 'fuel', 'mini', 'product', 'less', 'mini'],
['saudi', 'investor', 'pick', 'savoy', 'london', 'famou'],
['eu', 'minist', 'mull', 'jet', 'fuel', 'tax', 'european'],
['australia', 'rate', 'australia', 'rais', 'benchmark'],
['insur', 'boss', 'plead', 'guilti', 'anoth', 'us', 'insur'],
]
})
# Get all unique words in a list
word_list = list(set(df['words'].explode()))
# Link txt files to unique words
# Note: list of txt files is one string comma separated to ensure single column in resulting DataFrame
word_dict = {
unique_word: [', '.join(df[df['words'].apply(lambda list_of_words: unique_word in list_of_words)]['txt_file'])] for unique_word in word_list
}
# Create DataFrame from dictionary (transpose to have words as row index).
words_in_files = pd.DataFrame(word_dict).transpose()
The dictionary word_dict might already be exactly what you need instead of holding on to a DataFrame just for the sake of using a DataFrame. If that is the case, remove the ', '.join() part from the dictionary creation, because it doesn't matter that the values of your dict are unequal in length.

Python: how to identify common elements in lists from two dataframes' series

Using Pandas, I have two data sets stored in two separate dataframes. Each dataframe is composed of two series.
The first dataframe has a series called 'name', the second series is a list of strings. It looks something like this:
name attributes
0 John [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1 Mike [EUD, DBS, QMD, ABC, GHI]
2 Jane [JKL, EJD, MDE, MNO, DEF, ABC]
3 Kevin [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4 Stefanie [STU, EJD, DUE]
The second dataframe is similar with the first series being
username attr
0 username_1 [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1 username_2 [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2 username_3 [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3 username_4 [DJH, EUD, GHI, MNO, ABC, FHE]
4 username_5 [CHQ, ELT, ABC, DEF, GHI]
What I'm trying to achieve is to compare the attributes (second series) of each dataframe to see which names and usernames share the most attributes.
For example, username_4 has 5 out of 6 attributes matching those of Kevin's.
I thought of looping one of the attributes series and see if there's a match in each row of the other series but couldn't loop effectively (maybe because my lists don't have quotation marks around the strings?).
I don't really know what possibilities exist to compare those two series and end up with a result as mentioned above (username_4 has 5 out of 6 attributes matching those of Kevin's).
What would be the possible approach(es) here?
You could try a method like below:
# Import pandas library
import pandas as pd
# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']],
['Stefanie', ['STU', 'EJD', 'DUE']]]
data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']],
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])
# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
outputDictionary = {} # Set a dictionary for our output
for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
dictBuilder = {}
for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
name = row['name']
dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
print(outputDictionary) # Debug print statement
return outputDictionary # Return our output dictionary here for further processing
a = func(df2, df1)
That should yield an output like below:
{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}
Where each item in the dictionary returned from outputDictionary will have:
Dictionary key value equal to the username from the second data frame
Dictionary value equal to a list, containing the name and count with the most matches as compared to our first data frame
Note that this method could be optimized in how it loops over each row in the two data frames - The thread below describes a few different ways to process rows in data frames:
How to iterate over rows in a DataFrame in Pandas

Splitting Columns that contains delimiters using Python

i am an incoming file with 100+ columns where in some columns we have comma separated values.
i have to convert those delimited columns in to multiple columns with same column header and its sequence.
for ex..if my input is below..
name,age,interests,sports,gender,year
aaa,44,"movies,poker","tennis,baseball",M,2000
bbb,23,"movies","hockey,baseball",F,2018
output should be..we should not hardcode the column names..which ever column has , it should be split.
name,age,interests_1,interest_2,sports_1,sports_2,gender,year
aaa, 44,movies, poker, tennis, baseball,M, 2000
bbb, 23,movies, hockey, baseball,F, 2018
Use these columns as columns of the file you are going to create:-
st = '''name,age,interests,sports,gender,year aaa,44,"movies,poker","tennis,baseball",M,2000 bbb,23,"movies","hockey,baseball",F,2018'''
columns = st.split(',')
>>columns
['name',
'age',
'interests',
'sports',
'gender',
'year aaa',
'44',
'"movies',
'poker"',
'"tennis',
'baseball"',
'M',
'2000 bbb',
'23',
'"movies"',
'"hockey',
'baseball"',
'F',
'2018']

how to create new column on the basis of word matched [duplicate]

This question already has answers here:
How to map key to multiple values to dataframe column?
(2 answers)
Closed 3 years ago.
how to add a new column on the basis searched item like if dataframe column contain BX-- then in new column it should replace it with BOX as there are more than 30 short form
i think dictionary would be best option for replacement
mapping= {
'BX': 'BOX',
'CS': 'CASE',
'EA': 'EACH',
'PK': 'PACK',
'None': None
}
import pandas as pd
lst = ['BX', 'EA', 'EA', 'PK', 'BG','CS']
df = pd.DataFrame(lst)
df.map(mapping)
somehow i am not able to do it
You can do this as follows.
# first define a mapping
mapping= {
'BX': 'BOX',
'CS': 'CASE',
'EA': 'EACH',
'PK': 'PACK',
'None': None
}
# then apply it with map (assuming your abbreviations are
# stored in column short and the result should be stroed
# in long)
df['long']=df['short'].map(mapping)
With the following test dataframe
lst = ['BX', 'EA', 'EA', 'PK', 'BG','CS']
df = pd.DataFrame(dict(short=lst))
df['short'].map(mapping)
It outputs:
Out[447]:
short long
0 BX BOX
1 EA EACH
2 EA EACH
3 PK PACK
4 BG NaN
5 CS CASE

Trim each column values at pandas

I am working on .xls files after import data to a data frame with pandas, need to trim them. I have a lot of columns. Each data starting xxx: or yyy: and in a column
for example:
xxx:abc yyy:def \n
xxx:def yyy:ghi \n
xxx:ghi yyy:jkl \n
...
I need to trim that xxx: and yyy: for each column. Researched and tried some issue solves but they doesn't worked. How can I trim that, I need an effective code. Already thanks.
(Unnecessary chars don't have static length I just know what are them look like stop words. For example:
['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB', ...]
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB', ...]
i want to new dataset look like:
['Apple', 'iPhone', '2018', '128GB', ...]
['Samsung', 'Note', '2017', '64GB', ...]
So I want to trim ('Comp:', 'Product:', 'Year:', ...) stop words for each column.
You can use pd.Series.str.split for this:
import pandas as pd
df = pd.DataFrame([['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB'],
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB']],
columns=['Comp', 'Product', 'Year', 'Memory'])
for col in ['Comp', 'Product', 'Year']:
df[col] = df[col].str.split(':').str.get(1)
# Comp Product Year Memory
# 0 Apple iPhone 2018 128GB
# 1 Samsung Note 2017 64GB

Categories

Resources