Pandas convert columns of one dataframe to index in another dataframe - python

I have some text files that are in .txt format.
I'm trying to create a .csv file with them so that the .txt files are in the index column.
I will add columns with demographic and statistical information (such as, L1, Prompt, and Level) later when editing the dataframe, but I want to align the txt files in the index so that I can do some NLTK analysis.
The desired output is:
L1 Prompt Level
FileName
data1.txt Japanese P1 High
data2.txt Korean P1 High
data3.txt Chinese P1 High
data4.txt Japanese P2 Med
data5.txt Korean P2 Med
data6.txt Chinese P2 Med
data7.txt Arabic P1 High
data8.txt German P1 High
data9.txt Spanish P1 High
data10.txt Arabic P2 Med
data11.txt German P2 Med
data12.txt Spanish P2 Med
The code I tried is as follows
df1=pd.read_csv('data1.txt',names=['data1'])
df2=pd.read_csv('data2.txt',names=['data2'])
df3=pd.read_csv('data3',names=['data3'])
result=pd.concat([df1,df2,df3],axis=1)
result.to_csv('mergedfile.txt',index=False)
but this of course, creates columns
data1.txt data2.txt data3.txt
0 XYZ GHI PQR
1 ABC JKL STU
2 DEF MNO VWX
XYZ and ABC are all sentences, such as, "One of the differences between my home country and the US is convenient stores." or "One difference is public transportation, everyone took public transportation in my home country, not so much in the US."
I have over 100,000 utterances for each txt file, so I don't want to put all of the data in the dataframe, and if i can get the txt file into the index column, that would be most ideal.
Ultimately, I want to export this to .csv, and then use it for further analysis.

You can just use the columns from your dataframe as index to a new dataframe:
df1 = pd.DataFrame({'data1': ['XYZ', 'ABC', 'DEF']})
df2 = pd.DataFrame({'data2': ['GHI', 'JKL', 'MNO']})
df3 = pd.DataFrame({'data3': ['PQR', 'STU', 'VWX']})
df = pd.concat([df1, df2, df3], axis=1)
print(df)
# data1 data2 data3
# 0 XYZ GHI PQR
# 1 ABC JKL STU
# 2 DEF MNO VWX
res = pd.DataFrame(index=[k+'.txt' for k in df],
columns=['L1', 'Prompt', 'Level'])
print(res)
# L1 Prompt Level
# data1.txt NaN NaN NaN
# data2.txt NaN NaN NaN
# data3.txt NaN NaN NaN

Related

Fuzzy Process extract one giving different result

I have a data frame and I am trying to map one of column values to values present in a set.
Data frame is
Name CallType Location
ABC IN SFO
DEF OUT LHR
PQR INCOMING AMS
XYZ OUTGOING BOM
TYR A_IN DEL
OMN A_OUT DXB
I have a Constant list where Call Type will be replaced by that in the list
call_type = set("IN","OUT")
Desired data frame
Name CallType Location
ABC IN SFO
DEF OUT LHR
PQR IN AMS
XYZ OUT BOM
TYR IN DEL
OMN OUT DXB
I wrote the code to check the response but the process.extractOne gives IN for OUTGOING sometimes (Which is wrong) and sometimes it gives OUT for OUTGOING (Which is right)
Here's is my code
data=[('ABC','IN','SFO),
('DEF','OUT','LHR),
('PQR','INCOMING','AMS),
('XYZ','OUTGOING','BOM),
('TYR','A_IN','DEL),
('OMN','A_OUT','DXB)]
df = pd.DataFrame(data,
columns =['Name', 'CallType',
'Location'])
call_types=set(['IN','OUT'])
df['Call Type'] = df['Call Type'].apply(lambda x: process.extractOne(x, list(call_types))[0])
total_rows=len(df)
for row_no in range(total_rows):
row=df.iloc[row_no]
print(row) // Here Sometimes OUTGOING sets as OUT and Sometimes IN . Shouldn't the result be consistent ?
I am not sure if there is a better way. Can someone please suggest if I am missing something.
Looks like Series.str.extract is a good fit for this:
df['CallType'] = df.CallType.str.extract(r'(OUT|IN)')
print(df)
Name CallType Location
0 ABC IN SFO
1 DEF OUT LHR
2 PQR IN AMS
3 XYZ OUT BOM
4 TYR IN DEL
5 OMN OUT DXB
Or, if you want to use call_types explicitly, you can do:
df['CallType'] = df.CallType.str.extract(fr"({'|'.join(call_types)})")
# same result
A possible solution is to use difflib.get_close_matches:
import difflib
df['CallType'] = df['CallType'].apply(
lambda x: difflib.get_close_matches(x, call_type)[0])
Output:
Name CallType Location
0 ABC IN SFO
1 DEF OUT LHR
2 PQR IN AMS
3 XYZ OUT BOM
4 TYR IN DEL
5 OMN OUT DXB
Another possible solution:
df['CallType'] = np.where(df['CallType'].str.contains('OUT'), 'OUT', 'IN')
Output:
# same

Python Text File to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number
Register
City
Id
Share
Name
Born
c
f
h
i
01600
4314
London
1
73/1284
John Smith
1960-01-01
NaN
4222/2001
1334/2000
5774/2000
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/1988
4222/2000
NaN
NaN
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number
Register
City
Id
Share
Name
Born
f
h
i
c
0
01600
4314
London
1
73/1284
John Smith
1960-01-01
4222/2001
1334/2000
5774/2000
nan
1
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/2000
nan
nan
4222/1988

How to compare two different Data frames on simile column values and put values to other data frame

I need to automate the validations performed on text file. I have two text files and I need to check if the row in one file having unique combination of two columns is present in other text file having same combination of columns then the new column in text file two needs to be written in text file one.
The text file 1 has thousands of records and text file 2 is considered as reference to text file 1.
As of now I have written the following code. Please help me to solve this.
import pandas as pd
data=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample2.txt",delimiter=',')
df=pd.DataFrame(data)
print(df)
# uniquecal=df[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal)
data1=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample1.txt",delimiter=',')
df1=pd.DataFrame(data1)
print(df1)
# uniquecal1=df1[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal1
How can I put the vehicle price in dataframe one and save it to text file1?
Below is my sample dataset:
File1:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda NaN
1 aaa yyy mumbai tvs NaN
2 aaa xxx hyd maruti NaN
3 bbb xxx pune honda NaN
4 bbb aaa mumbai tvs NaN
File2:
vehicle_Brought_City Vehicle_Brand Vehicle_price
0 pune honda 50000
1 mumbai tvs 40000
2 hyd maruti 45000
del df['Vehicle_price']
print(df)
dd = pd.merge(df, df1, on=['vehicle_Brought_City', 'Vehicle_Brand'])
print(dd)
output:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda 50000
1 aaa yyy mumbai tvs 40000
2 bbb aaa mumbai tvs 40000
3 aaa xxx hyd maruti 45000

matching rows between dataframes in pandas in python

I have two dataframes,
df1,
Names
one two three
Sri is a good player
Ravi is a mentor
Kumar is a cricketer
df2,
values
sri
NaN
sri, is
kumar,cricketer
I am trying to get the row in df1 which contains the all the items in df2
My expected output is,
values Names
sri Sri is a good player
NaN
sri, is Sri is a good player
kumar,cricketer Kumar is a cricketer
i tried, df1["Names"].str.contains("|".join(df2["values"].values.tolist()))
but I cannot achieve my expected output as it has (","). Please help
Using sets
s1 = df1.Names.dropna()
s1.loc[:] = [set(x.lower().split()) for x in s1.values.tolist()]
a1 = s1.values
s2 = df2['values'].dropna()
s2.loc[:] = [set(x.replace(' ', '').lower().split(',')) for x in s2.values.tolist()]
a2 = s2.values
i = np.column_stack([a1 >= a2[:, None], [True] * len(a2)]).argmax(1)
df2.assign(Names=pd.Series(
np.append(df1.Names.values, np.nan)[i], s2.index
))
values Names
0 sri Sri is a good player
1 NaN NaN
2 sri, is Sri is a good player
3 kumar,cricketer Kumar is a cricketer
import pandas as pd
names = [
'one two three',
'Sri is a good player',
'Ravi is a mentor',
'Kumar is a cricketer'
]
values = [
'sri',
'NaN',
'sri, is',
'kumar,cricketer',
]
names = pd.Series(names)
values = pd.DataFrame(values, columns=['values'])
def foo(words):
names_copy = names.copy()
for word in words.split(','):
names_copy = names_copy[names_copy.str.contains(word, case=False)]
return names_copy.values
values['names'] = values['values'].map(foo)
values
values names
0 sri [Sri is a good player]
1 NaN []
2 sri, is [Sri is a good player]
3 kumar,cricketer [Kumar is a cricketer]

Pandas - how to remove spaces in each column in a dataframe?

I'm trying to remove spaces, apostrophes, and double quote in each column data using this for loop
for c in data.columns:
data[c] = data[c].str.strip().replace(',', '').replace('\'', '').replace('\"', '').strip()
but I keep getting this error:
AttributeError: 'Series' object has no attribute 'strip'
data is the data frame and was obtained from an excel file
xl = pd.ExcelFile('test.xlsx');
data = xl.parse(sheetname='Sheet1')
Am I missing something? I added the str but that didn't help. Is there a better way to do this.
I don't want to use the column labels, like so data['column label'], because the text can be different. I would like to iterate each column and remove the characters mentioned above.
incoming data:
id city country
1 Ontario Canada
2 Calgary ' Canada'
3 'Vancouver Canada
desired output:
id city country
1 Ontario Canada
2 Calgary Canada
3 Vancouver Canada
UPDATE: using your sample DF:
In [80]: df
Out[80]:
id city country
0 1 Ontario Canada
1 2 Calgary ' Canada'
2 3 'Vancouver Canada
In [81]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[81]:
id city country
0 1 Ontario Canada
1 2 Calgary Canada
2 3 Vancouver Canada
OLD answer:
you can use DataFrame.replace() method:
In [75]: df.to_dict('r')
Out[75]:
[{'a': ' x,y ', 'b': 'a"b"c', 'c': 'zzz'},
{'a': "x'y'z", 'b': 'zzz', 'c': ' ,s,,'}]
In [76]: df
Out[76]:
a b c
0 x,y a"b"c zzz
1 x'y'z zzz ,s,,
In [77]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[77]:
a b c
0 xy abc zzz
1 xyz zzz s
r'\1' - is a numbered capturing RegEx group
data[c] does not return a value, it returns a series (a whole column of data).
You can apply the strip operation to an entire column df.apply. You can apply the strip function this way.

Categories

Resources