Better way to parse multiple files and create a single dataframe - python

I want to:
Read a file into a dataframe
Do some data manipulation, etc.
Copy one column from the dataframe
Append that column to a second dataframe
Repeat 1-4 until all files are read
My implementation is:
all_data = [[]] #list to store each set of values
for i in file_list:
filepath = path + i
df=pd.read_csv(filepath,sep='\t',header=None,names=colsList)
#various data manipulation, melt, etc, etc, etc.
all_data.append(df['value'])
df_all = pd.DataFrame(all_data)
df_all=df_all.T #Transpose
df_all.set_axis(name_list, axis=1, inplace=True) #fix the column names
How could this have been better implemented?
Problems:
the data in the python list is transposed (appended by rows not columns)
I couldn't find a way to append by columns or transpose the list (with python list or with pandas) that would work without an error :(
Thanks in advance...

If you would keep data in dictionary then you would get columns.
But every column need uniq name - i.e. col1, col2, ect.
import pandas as pd
all_data = {}
all_data['col1'] = [1,2,3]
all_data['col2'] = [4,5,6]
all_data['col3'] = [7,8,9]
new_df = pd.DataFrame(all_data)
print(new_df)
Result:
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
The same with for-loop
I use io.StringIO only to simulate files in memory - but you should use directly path to file.
import pandas as pd
import io
file_data = {
'file1.csv': '1\t101\n2\t102\n3\t103',
'file2.csv': '4\t201\n5\t202\n6\t202',
'file3.csv': '7\t301\n8\t301\n9\t201',
}
file_list = [
'file1.csv',
'file2.csv',
'file3.csv',
]
# ---
all_data = {}
for number, i in enumerate(file_list, 1):
df = pd.read_csv( io.StringIO(file_data[i]), sep='\t', header=None, names=['value', 'other'] )
all_data[f'col{number}'] = df['value']
new_df = pd.DataFrame(all_data)
print(new_df)
You can also directly assign new column
new_df[f'column1'] = old_df['value']
import pandas as pd
import io
file_data = {
'file1.csv': '1\t101\n2\t102\n3\t103',
'file2.csv': '4\t201\n5\t202\n6\t202',
'file3.csv': '7\t301\n8\t301\n9\t201',
}
file_list = [
'file1.csv',
'file2.csv',
'file3.csv',
]
# ---
new_df = pd.DataFrame()
for number, i in enumerate(file_list, 1):
df = pd.read_csv( io.StringIO(file_data[i]), sep='\t', header=None, names=['value', 'other'] )
new_df[f'col{number}'] = df['value']
print(new_df)

Related

issue with index on a saved DataFrame imported with to_csv() function

hi i have create a DataFrame with pandas by a csv in this way
elementi = pd.read_csv('elementi.csv')
df = pd.DataFrame(elementi)
lst= []
lst2=[]
for x in df['elementi']:
a = x.split(";")
lst.append(a[0])
lst2.append(a[1])
ipo_oso = np.random.randint(0,3,76)
oso = np.random.randint(3,5,76)
ico = np.random.randint(5,6,76)
per_ico = np.random.randint(6,7,76)
df = pd.DataFrame(lst,index=lst2,columns=['elementi'])
# drop the element i don't use in the periodic table
df = df.drop(df[103:117].index)
df = df.drop(df[90:104].index)
df = df.drop(df[58:72].index)
df.head()
df['ipo_oso'] = ipo_oso
df['oso'] = oso
df['ico'] = ico
df['per_ico'] = per_ico
df.to_csv('period_table')
df.head()
and looks like this
when i save this table with to_csv() and import it in another project with read_csv() the index of table is considered as a column but is the index
e= pd.read_csv('period_table')
e.head()
or
e= pd.read_csv('period_table')
df =pd.DataFrame(e)
df.head()
how can i fix that :)
Just use index_col=0 as parameter of read_csv:
df = pd.read_csv('period_table', index_col=0)
df.head()

Create nested json lines from pipe delimited flat file using python

I have a text file pipe delimited as below. In that file for same ID, CODE and NUM combination we can have different INC and INC_DESC
ID|CODE|NUM|INC|INC_DESC
"F1"|"W1"|1|1001|"INC1001"
"F1"|"W1"|1|1002|"INC1002"
"F1"|"W1"|1|1003|"INC1003"
"F2"|"W1"|1|1002|"INC1003"
"F2"|"W1"|1|1003|"INC1004"
"F2"|"W2"|1|1003|"INC1003"
We want to create json like below where different INC and INC_DESC should come as an array for same combination of ID, CODE and NUM
{"ID":"F1","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1001, "INC_DESC":"INC1001"},{"INC":1002, "INC_DESC":"INC1002"},{"INC":1003, "INC_DESC":"INC1003"}]}
{"ID":"F2","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1002, "INC_DESC":"INC1002"},{"INC":1003, "INC_DESC":"INC1003"}]}
{"ID":"F2","CODE":"W2","NUM":1,"INC_DTL":[{"INC":1003, "INC_DESC":"INC1003"}]}
I tried below but it is not generating nested as I want
import pandas as pd
Input_File=f'V:\input.dat'
df=pd.read_csv(Input_File, sep='|')
json_output=f'V:\outfile.json'
output=df.to_json(json_output, orient='records')
import pandas as pd
# agg function
def agg_that(x):
l = [x]
return l
Input_File = f'V:\input.dat'
df = pd.read_csv(Input_File, sep='|')
# groupby columns
df = df.groupby(['ID', 'CODE', 'NUM']).agg(agg_that).reset_index()
# create new column
df['INC_DTL'] = df.apply(
lambda x: [{'INC': inc, 'INC_DESC': dsc} for inc, dsc in zip(x['INC'][0], x['INC_DESC'][0])], axis=1)
# drop old columns
df.drop(['INC', 'INC_DESC'], axis=1, inplace=True)
json_output = f'V:\outfile.json'
output = df.to_json(json_output, orient='records', lines=True)
OUTPUT:
{"ID":"F1","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1001,"INC_DESC":"INC1001"},{"INC":1002,"INC_DESC":"INC1002"},{"INC":1003,"INC_DESC":"INC1003"}]}
{"ID":"F1","CODE":"W2","NUM":1,"INC_DTL":[{"INC":1003,"INC_DESC":"INC1003"}]}
{"ID":"F2","CODE":"W1","NUM":1,"INC_DTL":[{"INC":1002,"INC_DESC":"INC1003"},{"INC":1003,"INC_DESC":"INC1004"}]}

Processing values of corresponding fields in multiple CSV files using Python

I have been trying to find min and max of corresponding values in multiple CSV files.
Though, in reality, I have several files with multiple columns in each of them, here are three simple sample files:
a1,a2,a3
b1,b2,b3
c1,c2,c3
p1,p2,p3
q1,q2,q3
r1,r2,r3
x1,x2,x3
y1,y2,y3
z1,z2,z3
Using python, how to create list of corresponding values, example L1 = [a1,p1,x1,...] , L2=[a2,p2,x2,..] and so on.
Any easy way to create a CSV file containing min and max of the corresponding values of the above input files
min(a1,p1,x1)-max(a1,p1,x1), min(a2,p2,x2)-max(a2,p2,x2), min(a3,p3,x3)-max(a3,p3,x3)
min(b1,q1,y1)-max(b1,q1,y1), min(b2,q2,y2)-max(b2,q2,y2), min(b3,q3,y3)-max(b3,q3,y3)
min(c1,r1,z1)-max(c1,r1,z1), min(c2,r2,z2)-max(c2,r2,z2), min(c3,r3,z3)-max(c3,r3,z3)
Any help would be greatly appreciated
If you want to do it in pure Python, you could do something like this:
import csv
from contextlib import ExitStack
in_filenames = ["file1.csv", "file2.csv", "file3.csv"]
out_filename = "file.csv"
with ExitStack() as stack:
readers = [
csv.reader(stack.enter_context(open(filename, "r")))
for filename in in_filenames
]
writer = csv.writer(stack.enter_context(open(out_filename, "w")))
while True:
try:
rows = zip(*[next(reader) for reader in readers])
except StopIteration:
break
else:
out_row = []
for numbers in rows:
numbers = [float(number) for number in numbers]
out_row.append(min(numbers) - max(numbers))
writer.writerow(out_row)
If you can use Pandas it's a bit easier:
import pandas as pd
in_filenames = ["file1.csv", "file2.csv", "file3.csv"]
out_filename = "file.csv"
dfs = [
pd.read_csv(filename, header=None)
for filename in in_filenames
]
dfs = [
pd.concat((df[col] for df in dfs), axis=1)
for col in dfs[0].columns
]
df = pd.concat(
(df.min(axis=1) - df.max(axis=1) for df in dfs),
axis=1
)
df.to_csv(out_filename, index=False, header=False)
(There's probably an easier way, I just don't see it yet.)

Need to pick 'second column' from multiple csv files and save all 'second columns' in one csv file

So I have 366 CSV files and I want to copy their second columns and write them into a new CSV file. Need a code for this job. I tried some codes available here but nothing works. please help.
Assuming all the 2nd columns are the same length, you could simply loop through all the files. Read them, save the 2nd column to memory and construct a new df along the way.
filenames = ['test.csv', ....]
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
filenames = glob.glob(r'D:/CSV_FOLDER' + "/*.csv")
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
This can accomplished with glob and pandas:
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(mylist[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,1]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,1]).columns
df[colName] = pd.DataFrame(t.iloc[:,1])
df.to_csv('output.csv', index=False)
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(csvList[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,0]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,0]).columns
df[colName] = pd.DataFrame(t.iloc[:,0])
df.to_csv('output.csv', index=False)

Create a new dataframe out of dozens of df.sum() series

I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)

Categories

Resources