How to merge csv file with different column numbers using Pandas

How to merge csv file with different column numbers using Pandas - python

I want merge multiple file with different number of columns. Seem I can merge but I'm not able to write the column names.
indir="/home/centos/Data/MERGED/"
fileList=glob.glob(indir+"*.tsv")
dd=[pd.read_csv(f,sep="\t",header=0)for f in fileList]
result=pd.concat(dd,axis=1, join='inner', ignore_index=True,sort=False)
column_file=[]
for f in fileList:
tp=pd.read_csv(f,sep="\t",header=0)
print(len(tp.columns.tolist()))
column_file.append(",".join(tp.columns.tolist()))
344
119
177
304
502
178
36
80
478
375
502
166

Related

Create new Pandas.DataFrame with .groupby(...).agg(sum) then recover unsummed columns

I'm starting with a dataframe of baseabll seasons a section of which looks similar to this:
Name Season AB H SB playerid
13047 A.J. Pierzynski 2013 503 137 1 746
6891 A.J. Pierzynski 2006 509 150 1 746
1374 Rod Carew 1977 616 239 23 1001942
1422 Stan Musial 1948 611 230 7 1009405
1507 Todd Helton 2000 580 216 5 432
1508 Nomar Garciaparra 2000 529 197 5 190
1509 Ichiro Suzuki 2004 704 262 36 1101
From these seasons, I want to create a dataframe of career stats; that is, one row for each player which is a sum of their AB, H, etc. This dataframe should still include the names of the players. The playerid in the above is a unique key for each player and should either be an index or an unchanged value in a column after creating the career stats dataframe.
My hypothetical starting point is df_careers = df_seasons.groupby('playerid').agg(sum) but this leaves out all the non-numeric data. With numeric_only = False I can get some sort of mess in the names columns like 'Ichiro SuzukiIchiro SuzukiIchiro Suzuki' from concatenation, but that just requires a bunch of cleaning. This is something I'd like to be able to do with other data sets and the actually data I have is more like 25 columns, so I'd rather understand a specific routine for getting the Name data back or preserving it from the outset rather than write a specific function and use groupby('playerid').agg(func) (or a similar process) to do it, if possible.
I'm guessing there's a fairly simply way to do this, but I only started learning Pandas a week ago, so there are gaps in my knowledge.

You can write your own condition how do you want to include non summed columns.
col = df.columns.tolist()
col.remove('playerid')
df.groupby('playerid').agg({i : lambda x: x.iloc[0] if x.dtypes=='object' else x.sum() for i in df.columns})
df:
Name Season AB H SB playerid
playerid
190 Nomar_Garciaparra 2000 529 197 5 190
432 Todd_Helton 2000 580 216 5 432
746 A.J._Pierzynski 4019 1012 287 2 1492
1101 Ichiro_Suzuki 2004 704 262 36 1101
1001942 Rod_Carew 1977 616 239 23 1001942
1009405 Stan_Musial 1948 611 230 7 1009405

If there is a one-to-one relationship between 'playerid' and 'Name', as appears to be the case, you can just include 'Name' in the groupby columns:
stat_cols = ['AB', 'H', 'SB']
groupby_cols = ['playerid', 'Name']
results = df.groupby(groupby_cols)[stat_cols].sum()
Results:
AB H SB
playerid Name
190 Nomar Garciaparra 529 197 5
432 Todd Helton 580 216 5
746 A.J. Pierzynski 1012 287 2
1101 Ichiro Suzuki 704 262 36
1001942 Rod Carew 616 239 23
1009405 Stan Musial 611 230 7
If you'd prefer to group only by 'playerid' and add the 'Name' data back in afterwards, you can instead create a 'playerId' to 'Name' mapping as a dictionary, and look it up using map:
results = df.groupby('playerid')[stat_cols].sum()
name_map = pd.Series(df.Name.to_numpy(), df.playerid).to_dict()
results['Name'] = results.index.map(name_map)
Results:
AB H SB Name
playerid
190 529 197 5 Nomar Garciaparra
432 580 216 5 Todd Helton
746 1012 287 2 A.J. Pierzynski
1101 704 262 36 Ichiro Suzuki
1001942 616 239 23 Rod Carew
1009405 611 230 7 Stan Musial

groupy.agg() can accept a dictionary that maps column names to functions. So, one solution is to pass a dictionary to agg, specifying which functions to apply to each column.
Using the sample data above, one might use
mapping = { 'AB': sum,'H': sum, 'SB': sum, 'Season': max, 'Name': max }
df_1 = df.groupby('playerid').agg(mapping)
The choice to use 'max' for those that shouldn't be summed is arbitrary. You could define a lambda function to apply to a column if you want to handle it in a certain way. DataFrameGroupBy.agg can work with any function that will work with DataFrame.apply.
To expand this to larger data sets, you might use a dictionary comprehension. This would work well:
dictionary = { x : sum for x in df.columns}
dont_sum = {'Name': max, 'Season': max}
dictionary.update(dont_sum)
df_1 = df.groupby('playerid').agg(dictionary)

Aggregations over specific columns of a large dataframe, with named output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.

Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()

Python Pandas: Append Dataframe To Another Dataframe Only If Column Value is Unique

I have two data frames that I want to append together. Below are samples.
df_1:
Code Title
103 general checks
107 limits
421 horseshoe
319 scheduled
501 zonal
df_2
Code Title
103 hello
108 lucky eight
421 little toe
319 scheduled cat
503 new item
I want to append df_2 to df_1 ONLY IF the code number in df_2 does not exist already in df_1.
Below is the dataframe I want:
Code Title
103 general checks
107 limits
421 horseshoe
319 scheduled
501 zonal
108 lucky eight
503 new item
I have searched through Google and Stackoverflow but couldn't find anything on this specific case.

Just append the filtered data frame
df3 = df2.loc[~df2.Code.isin(df.Code)]
df.append(df3)
Code Title
0 103 general checks
1 107 limits
2 421 horseshoe
3 319 scheduled
4 501 zonal
1 108 lucky eight
4 503 new item
Notice that you might end up with duplicated indexes, which may cause problems. To avoid that, you can .reset_index(drop=True) to get a fresh df with no duplicated indexes.
df.append(df3).reset_index(drop=True)
Code Title
0 103 general checks
1 107 limits
2 421 horseshoe
3 319 scheduled
4 501 zonal
5 108 lucky eight
6 503 new item

You can concat and then drop_duplicates. Assumes within each dataframe Code is unique.
res = pd.concat([df1, df2]).drop_duplicates('Code')
print(res)
Code Title
0 103 general_checks
1 107 limits
2 421 horseshoe
3 319 scheduled
4 501 zonal
1 108 lucky_eight
4 503 new_item

Similar to concat(), you could also use merge:
df3 = pd.merge(df_1, df_2, how='outer').drop_duplicates('Code')
Code Title
0 103 general checks
1 107 limits
2 421 horseshoe
3 319 scheduled
4 501 zonal
6 108 lucky eight
9 503 new item

Can't set column name from index to str(index) + string (Pandas, Python)

I need to change the names of a subset of columns in a dataframe from whatever number they are to that number plus a string suffix. I know there is a function to add a suffix, but it doesn't seem to work on just indices.
I create a list with all the column indices in it, then run a loop that, for each item in that list, it renames the dataframe column that matches the list item to the same number, plus the suffix string.
if scalename == "CDR":
print(scaledf.columns.tolist())
oldCols = scaledf.columns[7:].tolist()
for f in range(len(oldCols)):
changeCol = int(oldCols[f])
print(changeCol)
scaledf.rename(columns = {changeCol:scalename + str(changeCol)})
print(scaledf.columns)
This doesn't work.
The code will print out the column names, and prints out every item, but it does not rename the columns. It doesn't throw errors, it just doesn't work. I've tried variation after variation, and gotten all kinds of other errors, but this error-free code does nothing. It just runs, and doesn't rename anything.
Any help would be seriously appreciated! Thank you.
Adding sample of list:
45
52
54
55
59
60
61
66
67
68
69
73
74
75
80
81
82
94
101
103
104
108
110
115
116
117
129
136
138
139
143
144
145
150
151
157
158
159
171
178
180
181
185
186
187
192
193
199
200
201
213
220
222
223
227
228
229
234
235
236

Try this:
scaledf = scaledf.rename(columns=lambda c:scalename + str(c) if c in oldCols else c)

Iterate over HDFStore using chunksize saving into new HDFStore

I got all my data into a HDFStore (yeah!), but how to get it out of it..
I've saved 6 DataFrames as frame_table in my HDFStore. Each of these table looks like the following, but the length varies (date is Julian date).
>>> a = store.select('var1')
>>> a.head()
var1
x_coor y_coor date
928 310 2006257 133
932 400 2006257 236
939 311 2006257 253
941 312 2006257 152
942 283 2006257 68
Then I select from all my tables the values where the date is e.g > 2006256.
>>> b = store.select_as_multiple(['var1','var2','var3','var4','var5','var6'], where=(pd.Term('date','>',date)), selector= 'var1')
>>> b.head()
var1 var2 var3 var4 var5 var6
x_coor y_coor date
928 310 2006257 133 14987 7045 18 240 171
2006273 136 0 7327 30 253 161
2006289 125 0 -239 83 217 168
2006305 95 14604 6786 13 215 57
2006321 84 0 4548 13 133 88
This works, but only for the relatively small .h5 files. So for my normal .h5 files I would like to temporarily store it in a HDFStore using chunksize (since I've to add a new column based on this selection to it as well). I thought like this (using this):
for df in store.select_as_multiple(['var1','var2','var3','var4','var5','var6'], where=(pd.Term('date','>',date)), selector= 'var1', chunksize=15):
tempstore.put('test',pd.DataFrame(df))
But then only one chunk is added to the store. But with:
tempstore.append('test',pd.DataFrame(df))
I get ValueError: Can only append to Tables. What I'm doing wrong?

When you tried to do this with put it kept overwriting the store (with the latest chunk), then you get the error when you append (because you can't append to a storer / non-table).
That is:
put writes a single, non-appendable fixed format (called a storer), which is fast to write, but you cannot append, nor query (only get it in its entirety).
append creates a table format, which is what you want here (and what a frame_table is).
Note: you don't need to do pd.DataFrame(df) as df is already a frame.
So, first do this (delete the store) if its there:
if 'test' in tempstore:
tempstore.remove('test')
Then append each DataFrame:
for df in store.select_as_multiple(.....):
tempstore.append('test', df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.