I have a list of csv files which I load as data frames using pd.read_csv()
I am currently trying to iterate through the list of csv and using the pd.concat() method and setting the axis parameter to one to add all the dataframes together by columns.
It is working as hoped however I am encountering the issue that since all of the data frames have the same colums names when I concatenated them I get for example ten columns all with the key "Date"
is there anyway that I can give the colums all unique names example London_Date, Berlin_Date? obviously the names being based on the name of the data frames.
If you pass a list of keys to concat(), you can then individually index any column you want with the given keys like so:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = df1
df3 = df1
add = pd.concat([df1, df2, df3], axis = 1, keys=['Group_1', 'Group_2', 'Group_3'])
print(add.Group_1.A) # or add.Group_2.B etc...
Related
When my df2, Group2 has the third item 'B3', I get what I want regarding the groupby. How can I get the output when the arrays are different lengths?
I also struggle with getting all data to CSV, not just the last iteration. I tried making the df before the loop and then merging it within, but something doesn't work.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.DataFrame({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
for column in df2.columns:
d_group = df1[df1.Title.isin(df2[column])]
df = d_group.groupby('Whole')['Whole'].count()\
.rename('Column Name from df2')\
.reindex(['part', 'full', 'semi'], fill_value='-')\
.reset_index()
df.T.to_csv('all_groups2.csv', header=False, index=True)
print(df.T)
Desired output:
Whole | part | full | semi
--------+---------+----------+----------
Group1 | - | 3 | -
Group2 | - | - | 2
In Pandas Dataframe, it is expected to have columns (or rows) with the same shapes. Therefore, it is not possible to have df2 in your code.
I recommend using the Series instead, something like this:
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'part', 'semi']})
group1 = pd.Series(['A1', 'A2', 'A3'])
group2 = pd.Series(['B1', 'B2'])
Then, you can filter and groupby dataframe df1 by isin function:
dfg1 = df1[df1['Title'].isin(group1)].groupby('Whole').count()
dfg2 = df1[df1['Title'].isin(group2)].groupby('Whole').count()
And finally join them by concat on axis=1:
res = pd.concat([dfg1, dfg2], axis=1)
res.columns = ['Group1','Group2']
finaldf = res.T
The result is the following:
full part semi
Group1 3.0 NaN NaN
Group2 NaN 1.0 1.0
And finally, you can write it to a CSV with the same code that you had:
finaldf.to_csv('result.csv', header=False, index=True)
I recommend not writing row by row to a file, unless it is a very huge file and you cannot store it in memory. In that case, I recommend partitioning or using Dask.
I just realized I could just load my df2 as pd.series, and iterate over the index, not column, to get where I wanted to be.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'C1', 'C2', 'C3'],
'ID': ['B1', 'B2', 'B3', 'A1', 'D2', 'D3'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.Series({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
df = pd.DataFrame()
for index in df2.index:
d_group = (df1[df1.ID.isin(df2[index])])
df3 = d_group.groupby('Whole')['Whole'].count()\
.rename(index, inplace=True)\
.reindex(['part', 'full', 'semi'], fill_value='-')
df = df.append(df3, ignore_index=False, sort=False)
print(df)
I want to append lists of dataframes in an existing list of lists:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
fr_list = [[] for x in range(2)]
fr_list[0].append(df1)
fr_list[0].append(df1)
fr_list[1].append(df1)
fr2 = [[] for x in range(2)]
fr2[0].append(df1)
fr2[1].append(df1)
fr_list.append(fr2) # <-- here is the problem
Output: fr_list = [[df1, df1], [df1], [fr2[0], fr2[1]]] List contains 3 elements
Expected: fr_list = [[df1, df1, fr2[0]],[df1, fr2[1]]] List contains 2 elements
fr_list=[a+b for a,b in zip(fr_list,fr2)]
Replace fr_list.append(fr2) with the above code
Explanation: using zip & list comprehension, add corresponding lists in fr_list & fr2. What you did was appended the outer list in fr_list with outer list in fr & not the inner lists.
I have two Datframes like
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['j0', 'j1', 'j2'])
right = pd.DataFrame({'A': ['A1', 'A0', 'A2'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
i want to column bind by the 'A' column. how to achieve this ?
and i want to do this by pd.concat not pd.merge
When I merge two simple dataframes, then everything works fine. But when I apply the same code to my real dataframes, then the merging does not work correctly:
I want to merge df1 and df2 on column A using left joining.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4','A5'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
In this case the result is correct (the number of rows in result is the same as df1).
However when I run the same code on my real data, the number of rows in result is much larger than df1 and is more similar to df2.
result = pd.merge(df1, df2[["ID","EVENT"]], how='left', on='ID')
The field ID is of type String (astype(str)).
What might be the reason on this? I cannot post here the real dataset, but maybe some indications still might be done based on my explanation. Thanks.
UDPATE:
I checked the dataframe result and I can see many duplicated rows having the same ID. Why?
See this slightly modified example (I modified the last two values in column A in df2):
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A0','A0'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
Output:
A B C
0 A0 B0 C0
1 A0 B0 C4
2 A0 B0 C5
3 A1 B1 C1
4 A2 B2 C2
5 A3 B3 C3
There is one A0 row for each A0 in df2. This is also what is happening with your data.
I have the following 2 simple dataframes.
df1:
df2:
I want to add df2 to df1 by using something like:
df1["CF 0.3"]=df2
However, this only adds values where indexes in df1 and df2 are the same. I would like a way I can add a column so that missing indexes are automatically added and if there is not associated value of that index, it is filled with NaN. Something like this:
The way I did this is by writing
df1=df1.add(df2)
This adds automatically missing indexes but all values are NaN. Then I manually populated values by writing:
df1["CF 0.1"]=dummyDF1
df1["CF 0.3"]=dummyDF2
Is there an easier way to do this? I have a feeling I am missing something.
I hope you understand my question :)
Use concat refer to this documentation for detailed help.
And here is an example based on the documentation:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'X': ['A4', 'A5', 'A6', 'A7'],
'XB': ['B4', 'B5', 'B6', 'B7'],
'XC': ['C4', 'C5', 'C6', 'C7'],
'XD': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'YA': ['A8', 'A9', 'A10', 'A11'],
'YB': ['B8', 'B9', 'B10', 'B11'],
'YC': ['C8', 'C9', 'C10', 'C11'],
'YD': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
#To get the desired result you are looking for you need to reset the index.
#With the dataframes you have you may not be able to merge as well
#Since merge would need a common index or column
frames = [df1.reset_index(drop=True), df2.reset_index(drop=True), df3.reset_index(drop=True)]
df4 = pd.concat(frames, axis=1)
print df4
please read the docs
use concat or merge or join
http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra
Have a look at the concat function which does what you looking for here.