Splitting Columns that contains delimiters using Python - python

i am an incoming file with 100+ columns where in some columns we have comma separated values.
i have to convert those delimited columns in to multiple columns with same column header and its sequence.
for ex..if my input is below..
name,age,interests,sports,gender,year
aaa,44,"movies,poker","tennis,baseball",M,2000
bbb,23,"movies","hockey,baseball",F,2018
output should be..we should not hardcode the column names..which ever column has , it should be split.
name,age,interests_1,interest_2,sports_1,sports_2,gender,year
aaa, 44,movies, poker, tennis, baseball,M, 2000
bbb, 23,movies, hockey, baseball,F, 2018

Use these columns as columns of the file you are going to create:-
st = '''name,age,interests,sports,gender,year aaa,44,"movies,poker","tennis,baseball",M,2000 bbb,23,"movies","hockey,baseball",F,2018'''
columns = st.split(',')
>>columns
['name',
'age',
'interests',
'sports',
'gender',
'year aaa',
'44',
'"movies',
'poker"',
'"tennis',
'baseball"',
'M',
'2000 bbb',
'23',
'"movies"',
'"hockey',
'baseball"',
'F',
'2018']

Related

Returns a dataframe with a list of files containing the word

I have a dataframe:
business049.txt [bmw, cash, fuel, mini, product, less, mini]
business470.txt [saudi, investor, pick, savoy, london, famou]
business075.txt [eu, minist, mull, jet, fuel, tax, european]
business101.txt [australia, rate, australia, rais, benchmark]
business060.txt [insur, boss, plead, guilti, anoth, us, insur]
Therefore, I would like the output to include a column of words and a column of filenames that contain it. It should be like:
bmw [business049.txt,business055.txt]
australia [business101.txt,business141.txt]
Thank you
This is quite possibly not the most efficient/best way to do this, but here you go:
# Create DataFrame from question
df = pd.DataFrame({
'txt_file': ['business049.txt',
'business470.txt',
'business075.txt',
'business101.txt',
'business060.txt',
],
'words': [
['bmw', 'cash', 'fuel', 'mini', 'product', 'less', 'mini'],
['saudi', 'investor', 'pick', 'savoy', 'london', 'famou'],
['eu', 'minist', 'mull', 'jet', 'fuel', 'tax', 'european'],
['australia', 'rate', 'australia', 'rais', 'benchmark'],
['insur', 'boss', 'plead', 'guilti', 'anoth', 'us', 'insur'],
]
})
# Get all unique words in a list
word_list = list(set(df['words'].explode()))
# Link txt files to unique words
# Note: list of txt files is one string comma separated to ensure single column in resulting DataFrame
word_dict = {
unique_word: [', '.join(df[df['words'].apply(lambda list_of_words: unique_word in list_of_words)]['txt_file'])] for unique_word in word_list
}
# Create DataFrame from dictionary (transpose to have words as row index).
words_in_files = pd.DataFrame(word_dict).transpose()
The dictionary word_dict might already be exactly what you need instead of holding on to a DataFrame just for the sake of using a DataFrame. If that is the case, remove the ', '.join() part from the dictionary creation, because it doesn't matter that the values of your dict are unequal in length.

I want to check band skipping in antenna machines tested data by compare two csv files using Python, one is tested file and the other one is reference

Comparative view
firstly program open 1st csv reference file and read all antenna bands data then open other tested phone file and compare antenna bands whether any band skipped or not in tested phone data.
You can use pandas to do this. The following code returns records where column 3 is different between the 2 data sets (COL_3_x is column 3 from the LH table and COL_3_y is column 3 in the right hand table).
import pandas as pd
#load LH table
DF1 = pd.DataFrame({
'PHONE_TEST_TYPE': ['WCDMA_B1_CouplingTest', 'WCDMA_B2_CouplingTest', 'WCDMA_B3_CouplingTest', 'WCDMA_B5_CouplingTest', 'WCDMA_B8_CouplingTest']
,'COL_2': ['','' , '', '20', '20']
,'COL_3': ['skipped', 'skipped', 'skipped', 22.9621, 23.0011]
,'COL_4': ['','' , '',26, 26]
,'COL_5': ['','' , '', '406ms', '453ms']
})
#load RH table
DF2 = pd.DataFrame({
'PHONE_TEST_TYPE': ['WCDMA_B1_CouplingTest', 'WCDMA_B2_CouplingTest', 'WCDMA_B3_CouplingTest', 'WCDMA_B5_CouplingTest', 'WCDMA_B8_CouplingTest']
,'COL_2': ['','' , '', '20', '20']
,'COL_3': ['skipped', 'skipped', 'skipped', 'skipped' ,'skipped']
,'COL_4': ['','' , '',26, 26]
,'COL_5': ['','' , '', '406ms', '453ms']
})
#Combine via left join
DF3 = DF1.merge(DF2, on='PHONE_TEST_TYPE', how='left')
# Create a filter to identify records missing in column 3 on RH table.
FILT = DF3['COL_3_x'] != DF3['COL_3_y']
# present result in dataframe
DF3[FILT]
result:

How to split text into columns with multiple-character delimiters (<=>, <br>)?

I have a pd.dataframe with a cell containing lots of information, separated by some custom delimiters. I want to split this information into separate columns. Sample cell looks like this:
price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen
You can notice that at the end of this example there are also '<->' separators, separating some features within one column. I am ok with keeping them inside one column for now.
So my Dataframe looks somewhat like this:
A B
0 1 price<=>price<br>price<=>3100<br>(...)
1 2 price<=>price<br>price<=>54000<br>(...)
2 3 price<=>price<br>price<=>135600<br>(...)
So the pattern I can see is that:
column names are in between: '< br >' and <=>
values are in between: <=> and '< br >'
Is there any smooth way to do this in python? Ideally, I would like to have a solution that splits and puts all values into columns. I could do the column names manually then.
The desired output would be like this:
A price price[currency] rent (...)
0 1 3100 PLN 600 (...)
1 2 54000 CZK 1000 (...)
2 3 135600 EUR 8000 (...)
use str.split() method to split the data on <br> then split the chunks on <=>
str_ = 'price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen'
#list of str that looks like "<column><=><value>"
row_list = str_.split('<br>')
#split the string on "<=>" and save the resulting column value pair in a new list
row_cleaned = [row.split('<=>') for row in row_list]
#convert the list of column value pairs to a column list and val list
column_list, vals_list = zip(*row_cleaned)
print(column_list)
print(vals_list)
column_list:
('price', 'price', 'price[currency]', 'rent', 'rent', 'rent[currency]', 'deposit', 'deposit', 'deposit[currency]', 'm', 'rooms_num', 'building_type', 'floor_no', 'building_floors_num', 'building_material', 'windows_type', 'heating', 'build_year', 'construction_status', 'free_from', 'rent_to_students', 'equipment_types', 'security_types', 'media_types', 'extras_types')
val_list:
('price', '3100', 'PLN', 'price', '600', 'PLN', 'price', '', '', '100', '3', 'tenement', 'floor_2', '4', 'brick', 'plastic', 'gas', '1915', 'ready_to_use', '', '', '', '', 'cable-television<->internet<->phone', 'balcony<->basement<->separate_kitchen')

Adding a column to a panda dataframe with values from columns from another dataframe, depending on a key from a dictionary

I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1

Trim each column values at pandas

I am working on .xls files after import data to a data frame with pandas, need to trim them. I have a lot of columns. Each data starting xxx: or yyy: and in a column
for example:
xxx:abc yyy:def \n
xxx:def yyy:ghi \n
xxx:ghi yyy:jkl \n
...
I need to trim that xxx: and yyy: for each column. Researched and tried some issue solves but they doesn't worked. How can I trim that, I need an effective code. Already thanks.
(Unnecessary chars don't have static length I just know what are them look like stop words. For example:
['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB', ...]
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB', ...]
i want to new dataset look like:
['Apple', 'iPhone', '2018', '128GB', ...]
['Samsung', 'Note', '2017', '64GB', ...]
So I want to trim ('Comp:', 'Product:', 'Year:', ...) stop words for each column.
You can use pd.Series.str.split for this:
import pandas as pd
df = pd.DataFrame([['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB'],
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB']],
columns=['Comp', 'Product', 'Year', 'Memory'])
for col in ['Comp', 'Product', 'Year']:
df[col] = df[col].str.split(':').str.get(1)
# Comp Product Year Memory
# 0 Apple iPhone 2018 128GB
# 1 Samsung Note 2017 64GB

Categories

Resources