Merge two CSV's with unique columns in python - python

I have two CSV files representing data from two different years. I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. If a species was caught in one year but not the other, that column would only be present in that year. How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns?
File 1: "Date","Time","Species A","Species B", "Species X"
File 2: "Date","Time", "Species A", "Species B", "Species C"
I need the end result to be one csv with this header:
"Date","Time","Species A","Species B", "Species C", "Species X"

Someone else will probably post a solution using the csv module, so I'll give a pandas solution for comparison purposes:
import pandas as pd
df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")
df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)
Explanation:
First, we read in the two files:
>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
Date Time Species A Species B Species X
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
>>> df2
Date Time Species A Species B Species C
0 16 17 18 19 20
1 21 22 23 24 25
2 26 27 28 29 30
Then we simply concatenate them, which automatically fills the missing data with NaN:
>>> df = pd.concat([df1, df2])
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 NaN 5 2
1 6 8 9 NaN 10 7
2 11 13 14 NaN 15 12
0 16 18 19 20 NaN 17
1 21 23 24 25 NaN 22
2 26 28 29 30 NaN 27
You want them filled with 0 instead, so:
>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 0 5 2
1 6 8 9 0 10 7
2 11 13 14 0 15 12
0 16 18 19 20 0 17
1 21 23 24 25 0 22
2 26 28 29 30 0 27
This order isn't quite the one you asked for, though, you wanted Time and Date first, so:
>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
Date Time Species A Species B Species C Species X
0 1 2 3 4 0 5
1 6 7 8 9 0 10
2 11 12 13 14 0 15
0 16 17 18 19 20 0
1 21 22 23 24 25 0
2 26 27 28 29 30 0
And then we save it as a CSV file:
>>> df.to_csv("merged_fish.csv", index=False)
producing
Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0

Here's a csv module solution in Python 3:
import csv
# Generate some data...
csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''
csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''
with open('2012.csv','w') as f:
f.write(csv1)
with open('2013.csv','w') as f:
f.write(csv2)
# The actual program
years = ['2012.csv','2013.csv']
lines = []
headers = set()
for year in years:
with open(year,'r',newline='') as f:
r = csv.DictReader(f)
lines.extend(list(r)) # Merge lines from all files.
headers = headers.union(r.fieldnames) # Collect unique column names.
# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))
with open('result.csv','w',newline='') as f:
# The 3rd parameter is the default if the key isn't present.
w = csv.DictWriter(f,new_headers,0)
w.writeheader()
w.writerows(lines)
# View the result
with open('result.csv') as f:
print(f.read())
Output:
Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3

According to the docs, it looks like you should be able to read out both files, merge the keys from the 2 extracted dictionaries, then use the fieldnames and restval params on the writer to achieve your 0 defaults.

Related

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements

This is my code that I plan to use for creating a pie chart.
import csv
with open('C:\\Users\Bhuwan Bhatt\Desktop\IP PROJECT\Book1.csv', 'r') as file :
reader = csv.reader(file)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def piechart1():
df=pd.read_csv('data,csv', sep=' ', index_col=False,skipinitialspace=True\
,error_bad_lines=False,encoding= 'unicode_escape')
df=df.set_index(['Country'])
dfl=df.iloc[:,[14]]
final_df=dfl.sort_values(by='TotalMedal')
final_df.reset_index(inplace=True)
final_df.columns=('location','Total cases','Total Deaths')
final_df=final_df.drop(11,axis='index')
countries=df['Country']
tmedals=df['TotalMedal']
plt.pie(tmedals,labels=countries,explode=(0.1,0,0,0,0,0,0,0,0,0,0.2),shadow=True,autopct='%0.1f%%')
plt.title("Olympics data analysis\nTop 10 Countries", color='b',fontsize=12)
plt.gcf().canva.set_window_title("OLMPICS ANALYSIS")
plt.show()
I get this error for some reason:
AttributeError: 'DataFrameGroupBy' object has no attribute 'sort_values'
This is the CSV file I've been using:
Country SummerTimesPart Sumgoldmedal Sumsilvermedal Sumbronzemedal SummerTotal WinterTimesPart Wingoldmedal Winsilvermedal Winbronzemedal WinterTotal TotalTimesPart Tgoldmedal Tsilvermedal Tbronzemedal TotalMedal
 Afghanistan  14 0 0 2 2 0 0 0 0 0 14 0 0 2 2
 Algeria  13 5 4 8 17 3 0 0 0 0 16 5 4 8 17
 Argentina  24 21 25 28 74 19 0 0 0 0 43 21 25 28 74
 Armenia  6 2 6 6 14 7 0 0 0 0 13 2 6 6 14
INFO-----> SummerTimesPart : No. of times participated in summer by each country
WinterTimesPart : No. of times participated in winter by each country
In your code you set Country as Index and in this line
dfl=df.iloc[:,[14]]
you just pick one column which is TotalMedal.
After sorting and resetting index, you try to change column names by line
final_df.columns=('location','Total cases','Total Deaths')
Here is the error..you have filtered your dataframe for just one column and after resetting gets Country also in column. So you just have two columns in your dataframe and trying to change names of columns by providing three values.
Correct line could be -
final_df.columns=('location','TotalMedal')

Pivot a selection of columns from long to wide in pandas in a particular manner

import pandas as pd
from io import StringIO
csv = '''\
a,b,name,points,marks,sets
1,2,ben,22,5,13
1,2,dave,23,4,11
'''
df = pd.read_csv(StringIO(csv))
Given the above, which looks as:
a b name points marks sets
0 1 2 ben 22 5 13
1 1 2 dave 23 4 11
I would like to be able to reshape it to the following:
csv= '''\
a,b,ben_points,dave_points,ben_marks,dave_marks,ben_sets,dave_sets
1,2,22,23,5,4,13,11
'''
df = pd.read_csv(StringIO(csv))
Which looks as:
a b ben_points dave_points ben_marks dave_marks ben_sets dave_sets
0 1 2 22 23 5 4 13 11
I'm not sure how to go about this though - here there is one column (name)
being spread (?) with a combination of three others.
We could do unstack then flatten the multiple index column
s=df.set_index(['a','b','name']).unstack('name')
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s.reset_index(inplace=True)
s
a b ben_points dave_points ben_marks dave_marks ben_sets dave_sets
0 1 2 22 23 5 4 13 11
Same solution as above, with a different route :
s = df.set_index(["a", "b", "name"]).unstack("name").swaplevel(1, 0, axis=1)
#flatten the columns and join with "_"
s.columns = ["_".join(entry) for entry in s.columns.to_flat_index()]
#reset index, same as first solution
s = s.reset_index()

remove unnamed colums pandas dataframe

i'm a student and have a problem that i cant figure it out how to solve it.i have csv data like this :
"","","","","","","","","",""
"","report","","","","","","","",""
"","bla1","bla2","","","","bla3","","",""
"","bla4","bla5","","","","","bla6","",""
"","bla6","bla7","bla8","","1","2","3","4","5"
"","bla9","bla10","bla11","","6","7","8","9","10"
"","bla12","bla13","bla14","","11","12","13","14","15"
"","","","","","","","","",""
code for reading csv like this :
SMT = pd.read_csv(file.csv, usecols=(5,6,7,8), skiprows=(1,2,3), nrows=(3))
SMT.fillna(0, inplace=True)
SMT print out :
Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
expected output :
1 2 3 4
6 7 8 9
11 12 13 14
i already trying skiprows=(0,1,2,3) but it will be like this :
1 2 3 4
0 6 7 8 9
1 11 12 13 14
2 0 0 0 0
i already trying to put index=Flase SMT = pd.read_csv(file.csv,index=False, usecols=(5,6,7,8), skiprows=(1,2,3), nrows=(3)) or index_col=0/None/Falseis not working, and the last time I tried it like this :
df1 = SMT.loc[:, ~SMT.columns.str.contains('^Unnamed')]
and i got
Empty DataFrame
columns: []
Index: [0, 1, 2]
i just want to get rid the Unnamed: 5 ~ Unnamed: 8, how the correct way to get rid of this Unnamed thing ?
The "unnamed" just says, that pandas does not know how to name the columns. So these are just names. You could set the names like this in the read_csv
pd.read_csv("test.csv", usecols=(5,6,7,8), skiprows=3, nrows=3, header=0, names=["c1", "c2", "c3", "c4"])
Output:
c1 c2 c3 c4
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
You have to set header=0 so that pandas knows that this is usually the header. Or you set skiprows=4
Just assign new column names:
df = pd.read_csv('temp.csv', usecols=[5,6,7,8], skiprows=[1,2,3], nrows=3)
df.columns = range(1, 1+len(df.columns))

Is there a way to select values from a dataframe by indexing with values from another dataframe [duplicate]

This question already has answers here:
Pandas lookup from one of multiple columns, based on value
(6 answers)
Closed 3 years ago.
I have 2 dataframes, both identical lengths but with different sizes.
Essentially, I want to select the values in ref_df by using the values in df1['Data1'] as the columns input.
Below, you can see my solution, but is there a way to do this without using .ix or without using for loop? Also, how would I do this if my index was a datetime index instead of ['11','12','13','14']?
import pandas as pd
import numpy as np
data = {'21' : [1,2,3,4], '22' : [5,6,7,8], '23' : [9,10,11,12], '24' : [13,14,15,16]}
ref_df = pd.DataFrame(data, index=['11','12','13','14'])
df1 = pd.DataFrame({'Data': ['11','12','13','14'],'Data1': ['21','22','23','24']})
for index, row in df1.iterrows():
df1.ix[index, 'Derived'] = ref_df.iloc[ref_df.index.get_loc(row.Data), ref_df.columns.get_loc(row.Data1)]
df1 Data Data1
0 11 21
1 12 22
2 13 23
3 14 24
-----------
ref df 21 22 23 24
11 1 5 9 13
12 2 6 10 14
13 3 7 11 15
14 4 8 12 16
---------
df1 Data Data1 Derived
0 11 21 1.0
1 12 22 6.0
2 13 23 11.0
3 14 24 16.0
----------
If columns of df1 are in order as per ref_df index and columns, you can take diagonal values of ref_df as:
df1['Derived'] = np.diag(ref_df)
print(df1)
Data Data1 Derived
0 11 21 1
1 12 22 6
2 13 23 11
3 14 24 16
If not aligned change the order in ref_df according to df1 and use.
Or Use lookup directly:
df1['Derived'] = ref_df.lookup(df1['Data'], df1['Data1'])

Categories

Resources