I have two files with 2 columns each, I need to use 1 column from one and one from another and create a new file with 2 columns.
while i<500020:
columns=datas.readline()
columns2 = datas2.readline()
columns = columns.split(" ")
columns2 = columns2.split(" ")
colum.write(" {1} {0}".format((columns2[1]), (columns[1]) ))
i=i+1
My output is like this:
181.053131
0.0005301
168.785828
0.3596852
I want to show them on same line, EX:
181.053131 0.0005301
168.785828 0.3596852
You need to remove the newline from columns2[1]:
columns2 = datas.readline().rstrip('\n')
otherwise you'll always insert those newlines into your output.
I'd also remove the newline from columns and use an explicit newline when writing:
columns = datas.readline().rstrip('\n')
and
colum.write(" {1} {0}\n".format(columns2[1], columns[1]))
Related
I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.
We have the following dataframe
# raw_df
print(raw_df.to_dict())
{'Edge': {1: '-1.9%-2.2%', 2: '+5.8%-9.4%', 3: '+3.5%-7.2%'}, 'Grade': {1: 'D+D', 2: 'BF', 3: 'B-F'}}
We are trying to split these 2 columns into 4 columns. The Edge column should split after the first %, and the Grade column should split before the 2nd capital letter appears. The output should look like:
output_df
edge_1 edge_2 grade_1 grade_2
-1.9% -2.2% D+ D
+5.8% -9.4% B F
+3.5% -7.2% B- F
We have raw_df[['t1_grade', 't2_grade']] = raw_df['Grade'].str.extractall(r'([A-Z])').unstack() to split the Grade column, however the + and - are dropped here, which is a problem. And we are not sure how to split the Edge column after the first % appears.
We can use str.extract as follows:
df["edge_1"] = df["Edge"].str.extract(r'^([+-]?\d+(?:\.\d+)?%)')
df["edge_2"] = df["Edge"].str.extract(r'([+-]?\d+(?:\.\d+)?%)$')
df["grade_1"] = df["Grade"].str.extract(r'^([A-Z][+-]?)')
df["grade_2"] = df["Grade"].str.extract(r'([A-Z][+-]?)$')
The strategy here is to extract the first/last percentage/grade from the two current columns using regex.
Looks like you already have your solution, but here is another idea for splitting Edge without regex:
strip the trailing '%'
split by '%' with expand=True
add back '%'
df[['edge_1', 'edge_2']] = (
df['Edge'].str.rstrip('%').str.split('%', expand=True).add('%')
)
It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?
Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]
I Have a .txt file which has three columns in it.
id ImplementationAuthority.email AssignedEngineer.email
ALU02034116 bin.a.chen#shan.cn bin.a.chen#ell.com.cn
ALU02035113 Guolin.Pan#ell.com.cn
ALU02034116 bin.a.chen#ming.com.cn Guolin.Pan#ell.com.cn
ALU02022055 fria-sha-qdv#list.com
ALU02030797 fria-che-equipment-1#phoenix.com Balagopal.Velusamy#phoenix.com
I need to create two lists which comprises of values under the column Implementation Authority.mail and Assigned Engineer.mail. It works perfectly when columns have compltete values (i.e no null values). The values got mixed when column contains null values.
aengg=[]
iauth=[]
with open('test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 3:
aengg.append(columns[2])
iauth.append(columns[1])
print aengg
print iauth
I tried it with this code and it is perfectly worked for complete column values.
Can anyone please tell me a solution for null values?
It seems you don't have a separator. I use number of spaces for your case. And fill the blank with a None.
Try this:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
aengg = []
iauth = []
with open('C:\\temp\\test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 2:
# when there are more than 17 spaces between two elements, I consider it as a third element in the row, then I add a None between them
if row.index(columns[1]) > 17:
columns.insert(1, None)
# if there are less than 17 spaces between two elements, I consider it as the second element in the row, then I add a None to the tail
else:
columns.append(None)
print columns
aengg.append(columns[2])
iauth.append(columns[1])
print aengg
print iauth
Here is the output.
['id', 'ImplementationAuthority.email', 'AssignedEngineer.email']
['ALU02034116', 'bin.a.chen#shan.cn', 'bin.a.chen#ell.com.cn']
['ALU02035113', None, 'Guolin.Pan#ell.com.cn']
['ALU02034116', 'bin.a.chen#ming.com.cn', 'Guolin.Pan#ell.com.cn']
['ALU02022055', 'fria-sha-qdv#list.com', None]
['ALU02030797', 'fria-che-equipment-1#phoenix.com', 'Balagopal.Velusamy#phoenix.com']
['AssignedEngineer.email', 'bin.a.chen#ell.com.cn', 'Guolin.Pan#ell.com.cn', 'Guolin.Pan#ell.com.cn', None, 'Balagopal.Velusamy#phoenix.com']
['ImplementationAuthority.email', 'bin.a.chen#shan.cn', None, 'bin.a.chen#ming.com.cn', 'fria-sha-qdv#list.com', 'fria-che-equipment-1#phoenix.com']
You need to place a 'null' or 0 as place holder.
The interpreter would read Guolin.Pan#ell.com.cn in second row as the second column.
Try this
id ImplementationAuthority.email AssignedEngineer.email
ALU02034116 bin.a.chen#shan.cn bin.a.chen#ell.com.cn
ALU02035113 null Guolin.Pan#ell.com.cn
ALU02034116 bin.a.chen#ming.com.cn Guolin.Pan#ell.com.cn
ALU02022055 fria-sha-qdv#list.com null
ALU02030797 fria-che-equipment-1#phoenix.com Balagopal.Velusamy#phoenix.com
And then append values after checking not null.
with open('test.txt') as f:
for i, row in enumerate(f):
columns = row.split()
if len(columns) == 3:
if columns[2] != "null":
aengg.append(columns[2])
if columns[1] != "null":
iauth.append(columns[1])