I have a dataframe that looks like this:
contactId ticker
0 ABC XYZ
1 ABC ZZZ
0 BCA YYY
Creating a pivot like so:
final_df = final_df.pivot_table(index='contactId', columns='ticker', aggfunc=len, fill_value=0)
Results in the following output:
ticker XYZ ZZZ YYY
contactId
ABC 1 1 0
BCA 0 0 1
As an intermediary step (see request below), I am assuming we need to transform the pivot so that if value >0 then ticker, else (blank). ie:
ticker XYZ ZZZ YYY
contactId
ABC XYZ ZZZ
BCA YYY
Because the output I am looking for is list of space separated tickers + a text string per contactId:
contactId ticker description
ABC XYZ ZZZ The client is holding: XYZ ZZZ
BCA YYY The client is holding: YYY
For the intermediary step I tried the following (but it through a ValueError: Grouper for 'ticker' not 1-dimensional):
final_df = final_df.pivot_table(index='contactId', columns='ticker', values='ticker', fill_value="")
Can you please assist? Thank you for the help in advance!
Inspired by #sharatpc 's suggestion, after adding the below to remove duplicates:
df = df[pd.notnull(df['contactId'])]
This worked for me:
df = df.set_index('contactId').groupby('contactId')['ticker'].transform(lambda x: ' '.join(x)).reset_index()
Related
I have a data frame and I am trying to map one of column values to values present in a set.
Data frame is
Name CallType Location
ABC IN SFO
DEF OUT LHR
PQR INCOMING AMS
XYZ OUTGOING BOM
TYR A_IN DEL
OMN A_OUT DXB
I have a Constant list where Call Type will be replaced by that in the list
call_type = set("IN","OUT")
Desired data frame
Name CallType Location
ABC IN SFO
DEF OUT LHR
PQR IN AMS
XYZ OUT BOM
TYR IN DEL
OMN OUT DXB
I wrote the code to check the response but the process.extractOne gives IN for OUTGOING sometimes (Which is wrong) and sometimes it gives OUT for OUTGOING (Which is right)
Here's is my code
data=[('ABC','IN','SFO),
('DEF','OUT','LHR),
('PQR','INCOMING','AMS),
('XYZ','OUTGOING','BOM),
('TYR','A_IN','DEL),
('OMN','A_OUT','DXB)]
df = pd.DataFrame(data,
columns =['Name', 'CallType',
'Location'])
call_types=set(['IN','OUT'])
df['Call Type'] = df['Call Type'].apply(lambda x: process.extractOne(x, list(call_types))[0])
total_rows=len(df)
for row_no in range(total_rows):
row=df.iloc[row_no]
print(row) // Here Sometimes OUTGOING sets as OUT and Sometimes IN . Shouldn't the result be consistent ?
I am not sure if there is a better way. Can someone please suggest if I am missing something.
Looks like Series.str.extract is a good fit for this:
df['CallType'] = df.CallType.str.extract(r'(OUT|IN)')
print(df)
Name CallType Location
0 ABC IN SFO
1 DEF OUT LHR
2 PQR IN AMS
3 XYZ OUT BOM
4 TYR IN DEL
5 OMN OUT DXB
Or, if you want to use call_types explicitly, you can do:
df['CallType'] = df.CallType.str.extract(fr"({'|'.join(call_types)})")
# same result
A possible solution is to use difflib.get_close_matches:
import difflib
df['CallType'] = df['CallType'].apply(
lambda x: difflib.get_close_matches(x, call_type)[0])
Output:
Name CallType Location
0 ABC IN SFO
1 DEF OUT LHR
2 PQR IN AMS
3 XYZ OUT BOM
4 TYR IN DEL
5 OMN OUT DXB
Another possible solution:
df['CallType'] = np.where(df['CallType'].str.contains('OUT'), 'OUT', 'IN')
Output:
# same
I need to automate the validations performed on text file. I have two text files and I need to check if the row in one file having unique combination of two columns is present in other text file having same combination of columns then the new column in text file two needs to be written in text file one.
The text file 1 has thousands of records and text file 2 is considered as reference to text file 1.
As of now I have written the following code. Please help me to solve this.
import pandas as pd
data=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample2.txt",delimiter=',')
df=pd.DataFrame(data)
print(df)
# uniquecal=df[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal)
data1=pd.read_csv("C:\\Users\\hp\\Desktop\\py\\sample1.txt",delimiter=',')
df1=pd.DataFrame(data1)
print(df1)
# uniquecal1=df1[['vehicle_Brought_City','Vehicle_Brand']]
# print(uniquecal1
How can I put the vehicle price in dataframe one and save it to text file1?
Below is my sample dataset:
File1:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda NaN
1 aaa yyy mumbai tvs NaN
2 aaa xxx hyd maruti NaN
3 bbb xxx pune honda NaN
4 bbb aaa mumbai tvs NaN
File2:
vehicle_Brought_City Vehicle_Brand Vehicle_price
0 pune honda 50000
1 mumbai tvs 40000
2 hyd maruti 45000
del df['Vehicle_price']
print(df)
dd = pd.merge(df, df1, on=['vehicle_Brought_City', 'Vehicle_Brand'])
print(dd)
output:
fname lname vehicle_Brought_City Vehicle_Brand Vehicle_price
0 aaa xxx pune honda 50000
1 aaa yyy mumbai tvs 40000
2 bbb aaa mumbai tvs 40000
3 aaa xxx hyd maruti 45000
I have a Dataframe:
ID Name Salary($)
0 Alex Jones 44,000
1 Bob Smith 65,000
2 Peter Clarke 50,000
In order to protect the privacy of the individuals in this dataset, I want to mask the output of this Dataframe in a Jupyter notebook like this:
ID Name Salary($)
0 AXXX XXXX 44,000
1 BXX XXXXX 65,000
2 PXXXX XXXXX 50,000
Individually replacing characters in each name seems very crude to me. There must be a better approach?
You can concatenate the first character by the result of replace all characters in the remaining slice of each name using str.replace:
In[16]:
df['Name'] = df['Name'].str[0] + df['Name'].str[1:].str.replace('\w','X')
df
Out[16]:
ID Name Salary($)
0 0 AXXX XXXXX 44,000
1 1 BXX XXXXX 65,000
2 2 PXXXX XXXXXX 50,000
The sql operation is as below :
UPDATE table_A s SET t.stat_fips=s.stat_fips
WHERE t.stat_code=s.stat_code;
If a similar operation needs to be done on csv A comparing some value from csv B How to achieve this in Python?
Data:
Lets assume -
CSV A
col1 stat_code name
abc WY ABC
def NA DEF
ghi AZ GHI
CSV B
stat_fips stat_code
2234 WY
4344 NA
4588 AZ
Resulting CSV :
col1 stat_code name stat_fips
abc WY ABC 2234
def NA DEF 4344
ghi AZ GHI 4588
Adding the attempted code so far :
df = pd.read_csv('fin.csv',sep='\t', quotechar="'")
df = df.set_index('col1').stack(dropna=False).reset_index
df1['stat_fips'] = df1['stat_code']
print df1
(Not really sure on pandas. learning the basics yet)
Judging your example data, this looks like merge operation on your stat_code column:
import pandas as pd
df_a = pd.DataFrame([["abc", "WY", "ABC"], ["def", "NA", "DEF"]], columns= ["col1", "stat_code", "name"])
df_b = pd.DataFrame([[2234, "WY"], [4344, "NA"]], columns=["stat_fips", "stat_code"])
merged_df = pd.merge(df_a, df_b, on="stat_code", how="left")
print(merged_df)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NA DEF 4344
It seems you need map by dict d:
d = df2.set_index('stat_code')['stat_fips'].to_dict()
df1['stat_fips'] = df1['stat_code'].map(d)
print (df1)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588
Or merge with left join:
df3 = pd.merge(df1, df2, on='stat_code', how='left')
print (df3)
col1 stat_code name stat_fips
0 abc WY ABC 2234
1 def NaN DEF 4344
2 ghi AZ GHI 4588
I'm trying to remove spaces, apostrophes, and double quote in each column data using this for loop
for c in data.columns:
data[c] = data[c].str.strip().replace(',', '').replace('\'', '').replace('\"', '').strip()
but I keep getting this error:
AttributeError: 'Series' object has no attribute 'strip'
data is the data frame and was obtained from an excel file
xl = pd.ExcelFile('test.xlsx');
data = xl.parse(sheetname='Sheet1')
Am I missing something? I added the str but that didn't help. Is there a better way to do this.
I don't want to use the column labels, like so data['column label'], because the text can be different. I would like to iterate each column and remove the characters mentioned above.
incoming data:
id city country
1 Ontario Canada
2 Calgary ' Canada'
3 'Vancouver Canada
desired output:
id city country
1 Ontario Canada
2 Calgary Canada
3 Vancouver Canada
UPDATE: using your sample DF:
In [80]: df
Out[80]:
id city country
0 1 Ontario Canada
1 2 Calgary ' Canada'
2 3 'Vancouver Canada
In [81]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[81]:
id city country
0 1 Ontario Canada
1 2 Calgary Canada
2 3 Vancouver Canada
OLD answer:
you can use DataFrame.replace() method:
In [75]: df.to_dict('r')
Out[75]:
[{'a': ' x,y ', 'b': 'a"b"c', 'c': 'zzz'},
{'a': "x'y'z", 'b': 'zzz', 'c': ' ,s,,'}]
In [76]: df
Out[76]:
a b c
0 x,y a"b"c zzz
1 x'y'z zzz ,s,,
In [77]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[77]:
a b c
0 xy abc zzz
1 xyz zzz s
r'\1' - is a numbered capturing RegEx group
data[c] does not return a value, it returns a series (a whole column of data).
You can apply the strip operation to an entire column df.apply. You can apply the strip function this way.