I'm reading a number of csv files into python using a glob matching and would like to add the filename as a column in each of the dataframes. I'm currently matching on a pattern and then using a generator to read in the files as so:
base_list_of_files = glob.glob(matching_pattern)
loaded_csv_data_frames = (pd.read_csv(csv, encoding= 'latin-1') for csv in base_list_of_files)
for idx, df in enumerate(loaded_csv_data_frames):
df['file_origin'] = base_list_of_files[idx]
combined_data = pd.concat(loaded_csv_data_frames)
I however get the error ValueError: No objects to concatenate when I come to do the concatenation - why does the adding the column iteratively break the list of dataframes ?
Generators can only go through one iteration, at the end of which they throw a StopIteration exception which is automatically handled by the for loop. If you try to consume them again they will just raise StopIteration, as demonstrated here:
def consume(gen):
while True:
print(next(gen))
except StopIteration:
print("Stop iteration")
break
>>> gen = (i for i in range(2))
>>> consume(gen)
0
1
Stop iteration
>>> consume(gen)
Stop iteration
That's why you get the ValueError when you try to use loaded_csv_data_frames for a second time.
I cannot replicate your example, but here it is something that should be similar enough:
df1 = pd.DataFrame(0, columns=["a", "b"], index=[0, 1])
df2 = pd.DataFrame(1, columns=["a", "b"], index=[0, 1])
loaded_csv_data_frames = iter((df1, df2)) # Pretend that these are read from a csv file
base_list_of_files = iter(("df1.csv", "df2.csv")) # Pretend these file names come from glob
You can add the file of origin as a key when you concatenate. Add names too to give titles to your index levels.
>>> df = pd.concat(loaded_csv_data_frames, keys=base_list_of_files, names=["file_origin", "index"])
>>> df
a b
file_origin index
df1.csv 0 0 0
1 0 0
df2.csv 0 1 1
1 1 1
If you want file_origin to be one of your columns, just reset first level of the index.
>>> df.reset_index("file_origin")
file_origin a b
index
0 df1.csv 0 0
1 df1.csv 0 0
0 df2.csv 1 1
1 df2.csv 1 1
Related
This question have been asked multiple times in this community but I couldn't find the correct answers since I am beginner in Python. I got 2 questions actually:
I want to concatenate 3 columns (A,B,C) with its value into 1 Column. Header would be ABC.
import os
import pandas as pd
directory = 'C:/Path'
ext = ('.csv')
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
if f.endswith(ext):
head_tail = os.path.split(f)
head_tail1 = 'C:/Output'
k =head_tail[1]
r=k.split(".")[0]
p=head_tail1 + "/" + r + " - Revised.csv"
mydata = pd.read_csv(f)
new =mydata[["A","B","C","D"]]
new = new.rename(columns={'D': 'Total'})
new['Total'] = 1
new.to_csv(p ,index=False)
Once concatenated, is it possible to count the uniqueid and put the total in Column D? Basically, to get the total count per uniqueid (Column ABC),the data can be found on a link when you click that UniqueID. For ex: Column ABC - uniqueid1, -> click -> go to the next page, total of that uniqueid.
On the link page, you can get the total numbers of uniqueid by Serial ID
I have no idea how to do this, but I would really appreciate if someone can help me on this project and would learn a lot from this.
Thank you very much. God Bless
Searched in Google, Youtube and Stackoverflow, couldn't find the correct answer.
I'm not sure that I understand your question correctly. However, if you know exactly the column names (e.g., A, B, and C) that you want to concatenate you can do something like code below.
''.join(merge_columns) is to concatenate column names.
new[merge_columns].apply(lambda x: ''.join(x), axis=1) is to concatenate their values.
Then, you can count unique values of the new column using groupby().count().
new = mydata[["A","B","C","D"]]
new = new.rename(columns={'D': 'Total'})
new['Total'] = 1
# added lines
merge_columns = ['A', 'B', 'C']
merged_col = ''.join(merge_columns)
new[merged_col] = new[merge_columns].apply(lambda x: ''.join(x), axis=1)
new.drop(merge_columns, axis=1, inplace=True)
new = new.groupby(merged_col).count().reset_index()
new.to_csv(p ,index=False)
example:
# before
> new
A B C Total
0 a b c 1
1 x y z 1
2 a b c 1
# after execute added lines
> new
ABC Total
0 abc 2
1 xyz 1
Next time, try to specify your issues and give a minimal reproducible example.
This is just an example how to use pd.melt and pd.groupby.
I hope it helps with your question.
import pandas as pd
### example dataframe
df = pd.DataFrame([['first', 1, 2, 3], ['second', 4, 5, 6], ['third', 7, 8, 9]], columns=['ID', 'A', 'B', 'C'])
### directly sum up A, B and C
df['total'] = df.sum(axis=1, numeric_only=True)
print(df)
### how to create a so called long dataframe with melt
df_long = pd.melt(df, id_vars='ID', value_vars=['A', 'B', 'C'], var_name='ABC')
print(df_long)
### group long dataframe by column and sum up all values with this ID
df_group = df_long.groupby(by='ID').sum()
print(df_group)
I have three txt files with data,4 columns of numbers.I need to load them to one data frame (dimension [3,n] where n is lenght of column).Becouse I need only one column from each file I decided to use Series.from_csv() function but I cannot comprehend the output.
I have write this code:
names = glob.glob("*.txt")
for i in names:
rank = pd.Series.from_csv(i,sep=" ",index_col = 3)
print rank
And this print me one column of my data(thats good) but also one column filled entire with zeros like this:
0.039157 0
0.039001 0
0.038524 0
0.038579 0
0.038385 0
What I find more bizzare is when I use
rank = pd.Series.from_csv(i,sep=" ",index_col = 3).values
I got this:
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
So its mean that this zeros were values read from files? Then what is the first column from from before?I have tried many method,but I have failed to understand this.
I think you can use more common read_csv with delim_whitespace=True and usecols for filtering column, first append all DataFrames to list dfs and then use concat:
dfs = []
names = glob.glob("*.txt")
for i in names:
rank = pd.read_csv(i,delim_whitespace=True,usecols=[3])
print rank
dfs.append(rank)
df = pd.concat(dfs, axis=1)
Or with sep='\s+' - separator is arbitrary whitespace:
dfs = []
names = glob.glob("*.txt")
for i in names:
rank = pd.read_csv(i,sep='\s+',usecols=[3])
print rank
dfs.append(rank)
df = pd.concat(dfs, axis=1)
You can use also list comprehension:
files = glob.glob("*.txt")
dfs = [pd.read_csv(fp, delim_whitespace=True,usecols=[3]) for fp in files]
df = pd.concat(dfs, axis=1)
I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5
I have a matching algorithm which links students to projects. It's working, and I have trouble exporting the data to a csv file. It only takes the last value and exports that only, when there are 200 values to be exported.
The data that's exported uses each number as a value when I would like to get the whole 's' rather than the three 3 numbers which make up 's', which are split into three columns. I've attached the images below. Any help would be appreciated.
What it looks like
What it should look like
#Imports for Pandas
import pandas as pd
from pandas import DataFrame
SPA()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df = pd.DataFrame.from_records([dataPack])
df.to_csv('try.csv')
You keep overwriting in the loop so you only end up with the last bit of data, you need to append to the csv with df.to_csv('try.csv',mode="a",header=False) or create one df and append to that and write outside the loop, something like:
df = pd.DataFrame()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df.append(pd.DataFrame.from_records([dataPack]))
df.to_csv('try.csv') # write all data once outside the loop
A better option would be to open a file and pass that file object to to_csv:
with open('try.csv', 'w') as f:
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
pd.DataFrame.from_records([dataPack]).to_csv(f, header=False)
You get individual chars because you are using from_records passing a single string dataPack as the value so it iterates over the chars:
In [18]: df = pd.DataFrame.from_records(["foobar,"+"bar"])
In [19]: df
Out[19]:
0 1 2 3 4 5 6 7 8 9
0 f o o b a r , b a r
In [20]: df = pd.DataFrame(["foobar,"+"bar"])
In [21]: df
Out[21]:
0
0 foobar,bar
I think you basically want to leave as a tuple dataPack = (s, l, p,c, r) and use pd.DataFrame(dataPack). You don't really need pandas at all, the csv lib would do all this for you without needing to create Dataframes.
I am currently using this code:
import pandas as pd
AllDays = ['a','b','c','d']
TempDay = pd.DataFrame( np.random.randn(4,2) )
TempDay['Dates'] = AllDays
TempDay.to_csv('H:\MyFile.csv', index = False, header = False)
But when it prints it prints the array before the dates with a header row. I am seeking to print the dates before the TemperatureArray and no header rows.
Edit:
The file is with the TemperatureArray followed by Dates: [ TemperatureArray, Date].
-0.27724356949570034,-0.3096554106726788,a
-0.10619546908708237,0.07430127684522048,b
-0.07619665345406437,0.8474460146082116,c
0.19668718143436803,-0.8072994364484335,d
I am looking to print: [ Date TemperatureArray]
a,-0.27724356949570034,-0.3096554106726788
b,-0.10619546908708237,0.07430127684522048
c,-0.07619665345406437,0.8474460146082116
d,0.19668718143436803,-0.8072994364484335
The pandas.Dataframe.to_csv method has a keyword argument, header=True that can be turned off to disable headers.
However, it sometimes does not work (from experience).
Using it in conjunction with index=False should solve your issue.
For example, this snippet should fix your issue:
TempDay.to_csv('C:\MyFile.csv', index=False, header=False)
Here is a full example showing how it disables the header row:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(6,4))
>>> df
0 1 2 3
0 1.295908 1.127376 -0.211655 0.406262
1 0.152243 0.175974 -0.777358 -1.369432
2 1.727280 -0.556463 -0.220311 0.474878
3 -1.163965 1.131644 -1.084495 0.334077
4 0.769649 0.589308 0.900430 -1.378006
5 -2.663476 1.010663 -0.839597 -1.195599
>>> # just assigns sequential letters to the column
>>> df[4] = [chr(i+ord('A')) for i in range(6)]
>>> df
0 1 2 3 4
0 1.295908 1.127376 -0.211655 0.406262 A
1 0.152243 0.175974 -0.777358 -1.369432 B
2 1.727280 -0.556463 -0.220311 0.474878 C
3 -1.163965 1.131644 -1.084495 0.334077 D
4 0.769649 0.589308 0.900430 -1.378006 E
5 -2.663476 1.010663 -0.839597 -1.195599 F
>>> # here we reindex the headers and return a copy
>>> # using this form of indexing just requires you to provide
>>> # a list with all the columns you desire and in the order desired
>>> df2 = df[[4, 1, 2, 3]]
>>> df2
4 1 2 3
0 A 1.127376 -0.211655 0.406262
1 B 0.175974 -0.777358 -1.369432
2 C -0.556463 -0.220311 0.474878
3 D 1.131644 -1.084495 0.334077
4 E 0.589308 0.900430 -1.378006
5 F 1.010663 -0.839597 -1.195599
>>> df2.to_csv('a.txt', index=False, header=False)
>>> with open('a.txt') as f:
... print(f.read())
...
A,1.1273756275298716,-0.21165535441591588,0.4062624848191157
B,0.17597366083826546,-0.7773584823122313,-1.3694320591723093
C,-0.556463084618883,-0.22031139982996412,0.4748783498361957
D,1.131643603259825,-1.084494967896866,0.334077296863368
E,0.5893080536600523,0.9004299653290818,-1.3780062860066293
F,1.0106633581546611,-0.839597332636998,-1.1955992812601897
If you need to dynamically adjust the columns, and move the last column to the first, you can do as follows:
# this returns the columns as a list
columns = df.columns.tolist()
# removes the last column, the newest one you added
tofirst_column = columns.pop(-1)
# just move it to the start
new_columns = [tofirst_column] + columns
# then you can the rest
df2 = df[new_columns]
This simply allows you to take the current column list, construct a Python list from the current columns, and reindex the headers without having any prior knowledge on the headers.