I have a text file with data that repeoates every 3 rows. Lets say it is hash, directory, sub directory. The data looks like the following:
a3s2d1f32a1sdf321asdf
Dir_321321
Dir2_asdf
s21a3s21d3f21as32d1f
Dir_65465
Dir2_werq
asd21231asdfa3s21d
Dir_76541
Dir2_wbzxc
....
I have created a python script that takes the data and every 3 rows creates columns:
import pandas as pd
df1 = pd.read_csv('RogTest/RogTest.txt', delimiter = "\t", header=None)
df2 = df1[df1.index % 3 == 0]
df2 = df2.reset_index(drop=True)
df3 = df1[df1.index % 3 == 1]
df3 = df3.reset_index(drop=True)
df4 = df1[df1.index % 3 == 2]
df4 = df4.reset_index(drop=True)
df5 = pd.concat([df2, df3], axis=1)
df6 = pd.concat([df5, df4], axis=1)
#Rename columns
df6.columns = ['Hash', 'Dir_1', 'Dir_2']
#Write to csv
df6.to_csv('RogTest/RogTest.csv', index=False, header=True)
This works fine but I am curious if there is a more efficient way to do this aka less code?
You can use:
df_final = pd.DataFrame(np.reshape(df.values,(3, df.shape[0]/3)))
df_final.columns = ['Hash', 'Dir_1', 'Dir_2']
Output:
Hash Dir_1 Dir_2
0 a3s2d1f32a1sdf321asdf Dir_321321 Dir2_asdf
1 s21a3s21d3f21as32d1f Dir_65465 Dir2_werq
2 asd21231asdfa3s21d Dir_76541 Dir2_wbzxc
Related
I'm trying to compare two excel spreadsheets, remove the names that appear in both spreadsheets from the first spreadsheet and then export it to a csv file using python. I am new, but here's what I have so far:
import pandas as pd
data_1 = pd.read_excel (r'names1.xlsx')
bit_data = pd.DataFrame(data_1, columns= ['Full_Name'])
bit_col = len(bit_data)
data_2 = pd.read_excel (r'force_names.xlsx')
force_data = pd.DataFrame(data_2, columns= ['FullName'])
force_col = len(force_data)
for bit_num in range(bit_col):
for force_num in range(force_col):
if bit_data.iloc[bit_num,0] == force_data.iloc[force_num,0]:
data_1 = data_1.drop(data_1.index[[bit_num]])
data_1.to_csv(r"/Users/name/Desktop/Reports/Names.csv")
When I run it it gets rid of some duplicates but not all, any advice anyone has would be greatly appreciated.
Use pandas merge to get all unique names, with no duplicates. If you want to drop any names that are in both files (I'm not sure if that's what you're asking), you can do so. See this toy example:
row1list = ['G. Anderson']
row2list = ['Z. Ebra']
df1 = pd.DataFrame([row1list, row2list], columns=['FullName'])
row1list = ['G. Anderson']
row2list = ['C. Obra']
df2 = pd.DataFrame([row1list, row2list], columns=['FullName'])
df3 = df1.merge(df2, on='FullName', how='outer', indicator=True)
print(df3)
# FullName _merge
# 0 G. Anderson both
# 1 Z. Ebra left_only
# 2 C. Obra right_only
df3 = df3.loc[df3['_merge'] != 'both']
del df3['_merge']
print(df3)
# FullName
# 1 Z. Ebra
# 2 C. Obra
I'm trying to subtract one data frame from another which all results should result in a 0 or blank based on the data in each my current excel files but will result in 0, 1, 2, or blank in the future. While some do result in a 0 or blank I'm also getting a -1 and 1. Any help that can be provided will be appreciated.
The two Excel sheets are identical except for number changes in second column.
Example
ExternalId TotalInteractions
name1 1
name2 2
name3 2
name4 1
Both sheets will look like the example and the output will look the same. I just need the difference between the two sheets
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = df1['ExternalId']
df4 = df1['TotalInteractions']
df5 = df2['TotalInteractions']
df6 = df4.sub(df5)
frames = (df3, df6)
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
I managed to create a partial answer to getting the unexpected numbers. My problem now is that NewInter has more names than PrevInter does. Which results in a blank in TotalInteractions next to the new ExternalId. Any idea how to make it if it there is a blank to accept the value from NewInter?
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4 - df5
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
Figured out the issues. First part needed to be merged in order for the subtraction to work as the dataframes are not the same size. Also had to add in fill_value = 0 so it would take information from the new file.
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4.sub(df5, fill_value = 0)
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
I have this large data frame and I need to when certain resource are available for the first time. Let me explain it from my code.
df1 = df[df['Resource_ID'] == 1348]
df1 = df1[['Format', 'Range_Start', 'Number']]
df1["Range_Start"] = df1["Range_Start"].str[:7]
df1 = df1.groupby(['Format', 'Range_Start'], as_index=True).last()
pd.options.display.float_format = '{:,.0f}'.format
df1 = df1.unstack()
df1.columns = df1.columns.droplevel()
df2 = df1[1:4].sum(axis=0)
df2.name = 'sum'
df2 = df1.append(df2)
df3 = df2.T[['entry', 'sum']].copy()
df3.index = pd.to_datetime(df3.index)
Now print(df3.first('1D')) gives the following output:
Format entry sum
Range_Start
2011-07-01 97 72
I can now see that Resource_ID 1348 first occurs on 2011-07-01, how do I extract only the Year from this information?
This is my sample input csv data:
Access_Stat_ID,Resource_ID,Range_Start,Range_End,Name,Format,Number,Matched_URL
1,15,"2009-03-01 00:00:00","2009-03-31 23:59:59","Mar 2009","entry",3,""
203,13,"2009-04-01 00:00:00","2009-04-30 23:59:59","Apr 2009","entry",18,""
204,13,"2009-04-01 00:00:00","2009-04-30 23:59:59","Apr 2009","pdf",7,""
It seems need:
first_year = df3.index.year[0]
I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:
You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...
Morning,
I have 3 excels that i have imported via from excel. I am trying to create a DataFrame which has taken the name ('Ticker') column from each import, add the title of the excel ('Secto') and append it to eachother to create a new DataFrame. This new DataFrame will then be exported to excel.
AA = ['Aero&Def','REITs', 'Auto&Parts']
File = 'FTSEASX_'+AA[0]+'_Price.xlsx'
xlsx = pd.ExcelFile('C:/Users/Ben/'+File)
df = pd.read_excel(xlsx, 'Price_Data')
df = df[df.Identifier.notnull()]
df.fillna(0)
a = []
b = []
for i in df['Ticker']:
a.append(i)
b.append(AA[0])
raw_data = {'Ticker': a, 'Sector': b}
df2 = pd.DataFrame(raw_data, columns = ['Ticker', 'Sector'])
del AA[0]
for j in AA:
File = 'FTSEASX_'+j+'_Price.xlsx'
xlsx = pd.ExcelFile('C:/Users/Ben/'+File)
df3 = pd.read_excel(xlsx, 'Price_Data')
df3 = df3[df3.Identifier.notnull()]
df3.fillna(0)
a = []
b = []
for i in df3['Ticker']:
a.append(i)
b.append(j)
raw_data = {'Ticker': a, 'Sector': b}
df4 = pd.DataFrame(raw_data, columns = ['Ticker', 'Sector'])
df5 = df2.append(df4)
I am currently getting the below but obviously the 2nd import, titled 'REITs' is not getting captured.
Ticker Sector
0 AVON-GB Aero&Def
1 BA-GB Aero&Def
2 COB-GB Aero&Def
3 MGGT-GB Aero&Def
4 SNR-GB Aero&Def
5 ULE-GB Aero&Def
6 QQ-GB Aero&Def
7 RR-GB Aero&Def
8 CHG-GB Aero&Def
0 GKN-GB Auto&Parts
How would i go about achieving this? or is there a better more pythonic way of achieving this?
I would do it this way:
import pandas as pd
AA = ['Aero&Def','REITs', 'Auto&Parts']
# assuming that ['Ticker','Sector','Identifier'] columns are in 'B,D,E' Excel columns
xl_cols='B,D,E'
dfs = [ pd.read_excel('FTSEASX_{0}_Price.xlsx'.format(f),
'Price_Data',
parse_cols=xl_cols,
).query('Identifier == Identifier')
for f in AA]
df = pd.concat(dfs, ignore_index=True)
print(df[['Ticker', 'Sector']])
Explanation:
.query('Identifier == Identifier') - gives you only those rows where Identifier is NOT NULL (using the fact that value == NaN will always be False)
PS You don't want to loop through your data frames when working with Pandas...