I have a strange problem. In my folder i have .dat data with CO2 values from a CO2 sensor in the laboratory. Data from experiment 4,5,6,7,8 with names CO2_4.dat,CO2_5.dat,CO2_6.dat,CO2_7.dat,CO2_8.dat
I know how to read them manually. For example for reading CO2_4 this works :
dfCO2_4_manual = pd.read_csv(r'C:\data\CO2\co2_4.dat', sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","co2_4", "p"])
display(dfCO2_4_manual)
which gives me a dataframe with the correct values:
every minute one value
But if i want to loop over my folder and read them all with this technique ("which work for other CSV files from the laboratory") which is safing the dataframes in a dictionary:
exp_list =[4,5,6,7,8] # list with number of each experiment
path_CO2 = r'C:\data\CO2'
CO2_files = glob.glob(os.path.join(path_CO2, "*.dat"))
CO2_dict = {}
for f, i in zip(offline_files, exp_list):
CO2_dict["CO2_{0}".format(i)] = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
display(CO2_dict["CO2_4"])
gives me a dataframe with many skipped and completely wrong values.
If i open the CO2_4.dat data with text editor it looks like this:
Does someone know what is happening?
It's not clear how to help exactly given we don't have access to your files, however, is this line
for f, i in zip(offline_files, exp_list):
correct? Where is offline_files defined? It's not in the code you have provided. Also, are you wanting to analyze each df separately? Is that why you are storing them in a dictionary?
As an alternative you can store each df in a list and concatenate them. You can then group and apply analyses to them that way.
df_hold_list = []
for f, i in zip(CO2_files, exp_list): #changed file list name; please verify
df = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
df['file'] = 'CO2_{0}'.format(i) # add column for sorting/grouping
df_hold_list.append(df)
df_new = pd.concat(df_hold_list, axis=0) # check the axis, should be 0 or 1
I can't test the code, but should work. Let me know if it doesn't.
Related
This is driving me nuts! I have several Dataframe that I am trying to concatenate with pandas. The index is the filename. When I use df.to_csv for individual data frames I can see the index column (filename) along with the column of interest. When I concatenate along the filename axis I only get the column of interest and numbers. No filename.
Here is the code I am using as is. It works as I expect up until the "all_filename" line.
for filename in os.listdir(directory):
if filename.endswith("log.csv"):
df = pd.read_fwf(filename, skiprows=186, nrows=1, names=["Attribute"])
df['System_Library_Name'] = [x.split('/')[6] for x in df['Attribute']]
df2= pd.concat([df for filename in os.listdir(directory)], keys=[filename])
df2.to_csv(filename+"log_info.csv", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*log_info.csv'))
cat_log = pd.concat([pd.read_csv(f) for f in all_filenames ])
cat_log2= cat_log[['System_Library_Name']]
cat_log2.to_excel("log.xlsx", index=filename)
I have tried adding keys=filename to the 3rd to last line and giving the index a name with df.index.name=
I have used similar code before and had it work fine, however this is only one column that I am using from a larger original input file if that makes a difference.
Any advice is greatly appreciated!
df = pd.concat(
# this is just reading one value from each file, yes?
[pd.read_fwf(filename, skiprows=186, nrows=1, names=["Attribute"])
.set_index(pd.Index([filename]))
.applymap(lambda x: x.split('/')[6])
.rename(columns={'Attribute':'System_Library_Name'})
for filename in glob.glob(os.path.join(directory,'*log.csv'))
]
)
df.to_xlsx("log_info.xlsx")
I have two csv files, and I want to combine these two csv files into one csv file. Assume that the two csv files are A.csv and B.csv, I have already known that there are some conflicts in them. For example, there are two columns, ID and name, in A.csv ID "12345" has name "Jack", in B.csv ID "12345" has name "Tom". So there are conflicts that the same ID has different name. Now I want to keep ID "12345", and I want to choose name from A.csv, and abandon name from B.csv. How could I do that?
Here is some code I have tried, but it can only combine two csv files but connot deal with the conflicts, or more precisely, it cannot choose definite value from A.csv :
import pandas as pd
import glob
def merge(csv_list, outputfile):
for input_file in csv_list:
f = open(input_file, 'r', encoding='utf-8')
data = pd.read_csv(f, error_bad_lines=False)
data.to_csv(outputfile, mode='a', index=False)
print('Combine Completed')
def distinct(file):
df = pd.read_csv(file, header=None)
datalist = df.drop_duplicates()
datalist.to_csv('result_new_month01.csv', index = False, header = False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = glob.glob('*.csv')
output_csv_path = 'result.csv'
print(csv_list)
merge(csv_list)
distinct(output_csv_path)
P.S. English is not my native language. Please excuse my syntax error.
Will put down my comments here as an answer:
The problem with your merge function is, you're reading one file and writing it out the same result.csv in append mode without doing any check for duplicate names. In order to check for duplicates, they need to be in the same dataframe. From your code, you're combining multiple CSV files, not necessary just A.csv and B.csv. So when you say "I want to choose name from A.csv, and abandon name from B.csv" - it looks like you really mean "keep the first one".
Anyway, fix your merge() function to keep reading files into a list of dataframes - with A.csv being first. And then use #gold_cy's answer to concatenate the dataframes keeping only the first occurrence.
Or in your distinct() function, instead of datalist = df.drop_duplicates(), put datalist = df.drop_duplicates("ID", keep="first").reset_index(drop=True) - but this can be done after the read-loop instead writing out a CSV full of duplicates, first drop the dupes and then write out to csv.
So here's the change using the first method:
import pandas as pd
import glob
def merge_distinct(csv_list, outputfile):
all_frames = [] # list of dataframes
for input_file in csv_list:
# skip your file-open line and pass those params to pd.read_csv
data = pd.read_csv(input_file, encoding='utf-8', error_bad_lines=False)
all_frames.append(data) # append to list of dataframes
print('Combine Completed')
final = pd.concat(all_frames).drop_duplicates("ID", keep="first").reset_index(drop=True)
final.to_csv('result_new_month01.csv', index=False, header=False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = sorted(glob.glob('*.csv')) # sort the list
output_csv_path = 'result.csv'
print(csv_list)
merge_distinct(csv_list, output_csv_path)
Notes:
Instead of doing f = open(...) just pass those arguments to pd.read_csv().
why are you writing out the final csv with header=False? That's useful to have.
glob.glob() doesn't guarantee sorting (depends on filesystem) so I've used sorted() above.
File-system sorting is different from just sorting in sorted which is essentially in ASCII/Unicode index order.
If you want to keep one DataFrame value over the other, then concatenate them and keep the first duplicate in the output. This means the preferred values should be in the first argument to the sequence you provide to concatenate as shown below.
df = pd.DataFrame({"ID": ["12345", "4567", "897"], "name": ["Jack", "Tom", "Frank"]})
df1 = pd.DataFrame({"ID": ["12345", "333", "897"], "name": ["Tom", "Sam", "Rob"]})
pd.concat([df, df1]).drop_duplicates("ID", keep="first").reset_index(drop=True)
ID name
0 12345 Jack
1 4567 Tom
2 897 Frank
3 333 Sam
I have the following code:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset2 = pd.read_csv(file_path, header=None, dtype=str)
v = dataset2.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
dataset1 = pd.DataFrame(f)
df = dataset1.astype('str')
dataset = df.values.tolist()
print (type (dataset))
print (type (dataset[1]))
print (type (dataset[1][1]))
The target is to transfer all the dataset into values from 1..n for each different distinct value in dataset and afterwards to transform it into list of lists where each element is string.
The above code works great. However when I change the dataset into:
file_path ='https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
I get error. How can it work for this dataset as well?
You need to understand the data you're working with. A quick print call would've helped you realise the delimiters with this one are different.
Furthermore, it appears to be numeric data; you don't need an str conversion anymore.
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
t = pd.read_csv(file_path, header=None, delim_whitespace=True)
v = t.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
df = pd.DataFrame(f)
If you want pandas to guess the delimiter format, you might employ the use of sep=None:
t = pd.read_csv(file_path, header=None, sep=None)
I don't recommend this because it is very easy for pandas to make mistakes when loading your data with an inferred delimiter.
Python newbie, please be gentle. I have data in two "middle sections" of a multiple Excel spreadsheets that I would like to isolate into one pandas dataframe. Below is a link to a data screenshot.
Within each file, my headers are in Row 4 with data in Rows 5-15, Columns B:O. The headers and data then continue with headers on Row 21, data in Rows 22-30, Columns B:L. I would like to move the headers and data from the second set and append them to the end of the first set of data.
This code captures the header from Row 4 and data in Columns B:O but captures all Rows under the header including the second Header and second set of data. How do I move this second set of data and append it after the first set of data?
path =r'C:\Users\sarah\Desktop\Original'
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3 )
list_.append(df)
frame = pd.concat(list_)
Screenshot of my data
If all of your Excel files have the same number of rows and this is a one time operation, you could simply hard code those numbers in your read_excel. If not, it will be a little tricky, but you pretty much follow the same procedure:
for file_ in allFiles:
top = pd.read_excel(file_, sheetname="Data1", parse_cols="B:O", index_col=None,
header=4, skip_rows=3, nrows=14) # Note the nrows kwag
bot = pd.read_excel(file_, sheetname="Data1", parse_cols="B:L", index_col=None,
header=21, skip_rows=20, nrows=14)
list_.append(top.join(bot, lsuffix='_t', rsuffix='_b'))
you can do it this way:
df1 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:O",index_col=None, header=3, skip_rows=3)
df2 = pd.read_excel(file_,sheetname="Data1", parse_cols="B:L",index_col=None, header=20, skip_rows=20)
# pay attention at `axis=1`
df = pd.concat([df1,df2], axis=1)
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')