remove unnamed colums pandas dataframe - python

i'm a student and have a problem that i cant figure it out how to solve it.i have csv data like this :
"","","","","","","","","",""
"","report","","","","","","","",""
"","bla1","bla2","","","","bla3","","",""
"","bla4","bla5","","","","","bla6","",""
"","bla6","bla7","bla8","","1","2","3","4","5"
"","bla9","bla10","bla11","","6","7","8","9","10"
"","bla12","bla13","bla14","","11","12","13","14","15"
"","","","","","","","","",""
code for reading csv like this :
SMT = pd.read_csv(file.csv, usecols=(5,6,7,8), skiprows=(1,2,3), nrows=(3))
SMT.fillna(0, inplace=True)
SMT print out :
Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
expected output :
1 2 3 4
6 7 8 9
11 12 13 14
i already trying skiprows=(0,1,2,3) but it will be like this :
1 2 3 4
0 6 7 8 9
1 11 12 13 14
2 0 0 0 0
i already trying to put index=Flase SMT = pd.read_csv(file.csv,index=False, usecols=(5,6,7,8), skiprows=(1,2,3), nrows=(3)) or index_col=0/None/Falseis not working, and the last time I tried it like this :
df1 = SMT.loc[:, ~SMT.columns.str.contains('^Unnamed')]
and i got
Empty DataFrame
columns: []
Index: [0, 1, 2]
i just want to get rid the Unnamed: 5 ~ Unnamed: 8, how the correct way to get rid of this Unnamed thing ?

The "unnamed" just says, that pandas does not know how to name the columns. So these are just names. You could set the names like this in the read_csv
pd.read_csv("test.csv", usecols=(5,6,7,8), skiprows=3, nrows=3, header=0, names=["c1", "c2", "c3", "c4"])
Output:
c1 c2 c3 c4
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
You have to set header=0 so that pandas knows that this is usually the header. Or you set skiprows=4

Just assign new column names:
df = pd.read_csv('temp.csv', usecols=[5,6,7,8], skiprows=[1,2,3], nrows=3)
df.columns = range(1, 1+len(df.columns))

Related

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

How to reorder columns in a Pandas dataframe based on other dataframe columns

Suppose these dataframes:
import pandas as pd
df_one = pd.DataFrame({'col_1':[1, 2, 3, 4], 'col_2':[5,6,7,8], 'col_3':[9,10,11,12]})
df_two = pd.DataFrame({'col_1':[1, 2, 3, 4], 'col_3': [9,10,11,12], '2_col':[5, 6, 7, 8]})
In reality these dataframes come from different txt files so the concept of each column is the same but the order of columns is not, and some of the columns have a slightly different name. Both datasets have 33 columns representing the same concepts but in different order.
How can I reorder the second df with the same structure as the first df? Meaning same order of columns and same column names as df_one...
The final objective is to merge both df into a single consolidated one.
I have tried this:
cols = df_one.columns.to_list() # get columns names from df_one
df_two = df_two.reindex(columns=cols)
but this gets NaN values in 'col_2':
col_1 col_2 col_3
0 1 NaN 9
1 2 NaN 10
2 3 NaN 11
3 4 NaN 12
I also tried to first change col names in df_two and then reorder:
df_two.columns = cols
df_two = df_two.reindex(columns=cols)
but this also is wrong (col_2 now have the values of col_3):
col_1 col_2 col_3
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Thanks for your suggestions.
EDIT BASED ON COMMENTS:
Actual Column names are more like: 'Date' & 'iDate', 'Contract' & 'nContract', 'Premium' & 'iPremium'. I exemplified with numbers in the question (probably bad idea), but correlated numbers are not part of the names.
How can I map the order of columns in df_two ? (say, col 1 of df_1 is the same as col 1 in df_2, col 2 of df_1 is col_3 of df_2, col_3 of df_1 is col_2 of df_2) - And then I would rename the columns in df_2 as in df_1.
We can do
df[['col_2','col_3']]=-np.sort(-df[['col_2','col_3']].values,axis=1)
df
col_1 col_2 col_3
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I supposed that all columns names will have at least a number, so, you can order df_two based on the number, and then, rename the columns. You can trysomething like this:
import pandas as pd
import re
df_one = pd.DataFrame({'col_1':[1, 2, 3, 4], 'col_2':[5,6,7,8], 'col_3':[9,10,11,12]})
df_two = pd.DataFrame({'col_1':[1, 2, 3, 4], 'col_3': [9,10,11,12], '2_col':[5, 6, 7, 8]})
print('df_two old:\n\n',df_two,'\n')
def findnum(col):
return int(re.findall('\d+',col)[0])
df_two =df_two[sorted(df_two.columns, key=findnum)]
df_two.columns=df_one.columns
print('df_two new: \n')
print(df_two)
Output:
df_two old:
col_1 col_3 2_col
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
df_two new:
col_1 col_2 col_3
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your common paramater is like 'Contract' & 'ContractNum' as you said, you can try something like this:
import pandas as pd
df_one = pd.DataFrame({'Contract':[1, 2, 3, 4], 'Date':[5,6,7,8], 'Provider':[9,10,11,12]})
df_two = pd.DataFrame({'iDate':[1, 2, 3, 4], 'ContractNum': [9,10,11,12], 'nProvider':[5, 6, 7, 8]})
print('df_one:\n', df_one,'\n')
print('df_two:\n', df_two,'\n')
def func(pal):
for i,val in enumerate(df_one.columns):
if val.lower() in pal.lower():
return int(i)
df_two=df_two[sorted(df_two.columns, key=func)]
print('df_two sorted: ')
print(df_two,'\n')
df_two.columns=df_one.columns
print('df_two new colnames: ')
print(df_two,'\n')
Output:
df_one:
Contract Date Provider
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
df_two:
iDate ContractNum nProvider
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
df_two sorted:
ContractNum iDate nProvider
0 9 1 5
1 10 2 6
2 11 3 7
3 12 4 8
df_two new colnames:
Contract Date Provider
0 9 1 5
1 10 2 6
2 11 3 7
3 12 4 8
If the numbers are the common parameter between the columns, we can extract them and pass them into the .map function then reassign them using a custom dictionary.
df_two.columns = df_two.columns.str.extract("(\d+)")[0].map(
{col.split("_")[1]: col for col in df_one.columns}
).tolist()
#{'1': 'col_1', '2': 'col_2', '3': 'col_3'} <- dict
#['col_1', 'col_3', 'col_2'] <- map output that we re-assign.
print(df_two)
col_1 col_3 col_2
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
then you can merge/concat pd.concat([df_one,df_two])

Filling in a new data frame based on two other data frames

I want an efficient way to solve this problem below because my code seems inefficient.
First of all, let me provide a dummy dataset.
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}
df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
My actual dataset has more than 100,000 rows and 15 columns. Now, what I want to do is pretty complicated to explain, but here we go.
Goal: I want to create a new df using the two dfs above.
find the global min and max from df1. Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum. Therefore, I will find the minimum in column 'a0' and maximum in 'a4'.
Min = df1['a0'].min()
Max = df1['a4'].max()
Min
Max
Then I will create a data frame filled with 0s and columns of range(Min, Max). In this case, 1 through 7.
column = []
for i in np.arange(Min, Max+1):
column.append(i)
newdf = pd.DataFrame(0, index = df1.index, columns=column)
The third step is to find the place where the values from df2 will go:
I want to loop through each value in df1 and match each value with the column name in the new df in the same row.
For example, if we are looking at row 0 and go through each column; the values in this case [1,2,3,4,5]. Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2.
Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2.
So, the very first row of the new df will look like this:
output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}
output = pd.DataFrame(output)
Column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1.
Here is my code for this process:
for rowidx in range(0, len(df1)):
for columnidx in range(0,len(df1.columns)):
new_column = df1[str(df1.columns[columnidx])][rowidx]
newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]
I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame.
So, the code above runs forever. Is there a faster way to do this? I think I learned something like map-reduce, but I don't know if that would apply here.
Idea is create default columns names in both DataFrames, then concat of DataFrame.stacked Series, add first 0 column to index, remove second level, so possible use DataFrame.unstack:
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
.set_index(0, append=True)
.reset_index(level=1, drop=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (newdf)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Another solutions:
comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Or:
comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1

Load columns x1,x2,x3.. with headers: 0,1,

I'm new to python, and having an extremely frustrating problem. I need to load the columns 1-12 of a csv files (so not the 0th column), but I need to skip the header of the excel, and overwrite it with "0,1,..,11"
I need to use panda.read_csv() for this.
basically, my csv is:
"a", "b", "c", ..., "l"
1, 2, 3, ..., 12
1, 2, 3, ..., 12
and I want to load it as a dataframe such that
dataframe[0] = 2,2,2,..
dataframe[1] = 3,3,3..
ergo skipping the first column, and making the dataframe start with index 0.
I've tried setting usecols = [1,2,3..], but then the indexes are 1,2,3,.. .
Any help would be grateful.
You can use header=(int) to remove the header lines, usecols=range(1,12) to grab the last 11 columns, and names=range(11) to name the 11 columns from 0 to 10.
Here is a fake dataset:
This is the header. Header header header.
And the second header line.
a,b,c,d,e,f,g,h,i,j,k,l
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
Using the code:
> df = pd.read_csv('data_file.csv', usecols=range(1,12), names=range(11), header=2)
> df
# returns:
0 1 2 3 4 5 6 7 8 9 10
0 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12
> df[0]
# returns:
0 2
1 2
2 2

Merge two CSV's with unique columns in python

I have two CSV files representing data from two different years. I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. If a species was caught in one year but not the other, that column would only be present in that year. How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns?
File 1: "Date","Time","Species A","Species B", "Species X"
File 2: "Date","Time", "Species A", "Species B", "Species C"
I need the end result to be one csv with this header:
"Date","Time","Species A","Species B", "Species C", "Species X"
Someone else will probably post a solution using the csv module, so I'll give a pandas solution for comparison purposes:
import pandas as pd
df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")
df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)
Explanation:
First, we read in the two files:
>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
Date Time Species A Species B Species X
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
>>> df2
Date Time Species A Species B Species C
0 16 17 18 19 20
1 21 22 23 24 25
2 26 27 28 29 30
Then we simply concatenate them, which automatically fills the missing data with NaN:
>>> df = pd.concat([df1, df2])
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 NaN 5 2
1 6 8 9 NaN 10 7
2 11 13 14 NaN 15 12
0 16 18 19 20 NaN 17
1 21 23 24 25 NaN 22
2 26 28 29 30 NaN 27
You want them filled with 0 instead, so:
>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 0 5 2
1 6 8 9 0 10 7
2 11 13 14 0 15 12
0 16 18 19 20 0 17
1 21 23 24 25 0 22
2 26 28 29 30 0 27
This order isn't quite the one you asked for, though, you wanted Time and Date first, so:
>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
Date Time Species A Species B Species C Species X
0 1 2 3 4 0 5
1 6 7 8 9 0 10
2 11 12 13 14 0 15
0 16 17 18 19 20 0
1 21 22 23 24 25 0
2 26 27 28 29 30 0
And then we save it as a CSV file:
>>> df.to_csv("merged_fish.csv", index=False)
producing
Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0
Here's a csv module solution in Python 3:
import csv
# Generate some data...
csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''
csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''
with open('2012.csv','w') as f:
f.write(csv1)
with open('2013.csv','w') as f:
f.write(csv2)
# The actual program
years = ['2012.csv','2013.csv']
lines = []
headers = set()
for year in years:
with open(year,'r',newline='') as f:
r = csv.DictReader(f)
lines.extend(list(r)) # Merge lines from all files.
headers = headers.union(r.fieldnames) # Collect unique column names.
# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))
with open('result.csv','w',newline='') as f:
# The 3rd parameter is the default if the key isn't present.
w = csv.DictWriter(f,new_headers,0)
w.writeheader()
w.writerows(lines)
# View the result
with open('result.csv') as f:
print(f.read())
Output:
Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3
According to the docs, it looks like you should be able to read out both files, merge the keys from the 2 extracted dictionaries, then use the fieldnames and restval params on the writer to achieve your 0 defaults.

Categories

Resources