I have a csv with the following entries:
apple,orange,bannana,grape
10,5,6,4
four,seven,eight,nine
yes,yes,no,yes
3,5,7,4
two,one,six,nine
no,no,no,yes
2,4,7,8
yellow,four,eight,one
no,yes,no,no
I would like to make a new csv file with the following format and so on:
apple,10,four,yes
orange,5,seven,yes
bannana,6,seven,no
grape,4,nine,yes
apple,3,two,no
orange,5,one,no
bannana,7,six,no
grape,4,nine,yes
So after grape it starts at apple with the new values.
I have tried using pandas DataFrames but cant figure how to get the data formatted how I need it.
You could try the following in pure Python (data.csv name of input file):
import csv
from itertools import islice
with open("data.csv", "r") as fin,\
open("data_new.csv", "w") as fout:
reader, writer = csv.reader(fin), csv.writer(fout)
header = next(reader)
length = len(header) - 1
while (rows := list(islice(reader, length))):
writer.writerows([first, *rest] for first, rest in zip(header, zip(*rows)))
Or with Pandas:
import pandas as pd
df = pd.read_csv("data.csv")
df = pd.concat(gdf.T for _, gdf in df.set_index(df.index % 3).groupby(df.index // 3))
df.reset_index().to_csv("data_new.csv", index=False, header=False)
Output file data_new.csv for the provided sample:
apple,10,four,yes
orange,5,seven,yes
bannana,6,eight,no
grape,4,nine,yes
apple,3,two,no
orange,5,one,no
bannana,7,six,no
grape,4,nine,yes
apple,2,yellow,no
orange,4,four,yes
bannana,7,eight,no
grape,8,one,no
Hope it works for you.
df = pd.read_csv('<source file name>')
df.T.to_csv('<destination file name>')
You can transpose your dataframe in pandas as below.
pd.read_csv('file.csv', index_col=0, header=None).T
this question is already answered:
Can pandas read a transposed CSV?
According to your new description, the problem is completely changed.
you need to split your dataframe to subsets and merge them.
# Read dataframe without header
df = pd.read_csv('your_dataframe.csv', header=None)
# Create an empty DataFrame to store transposed data
tr = pd.DataFrame()
# Create, transpose and append subsets to new DataFrame
for i in range(1,df.shape[0],3):
... temp = pd.DataFrame()
... temp = temp.append(df.iloc[0])
... temp = temp.append(df.iloc[i:i+3])
... temp = temp.transpose()
... temp.columns = [0,1,2,3]
... tr = d.append(temp)
Related
Hello everyone I am learning python I am new I have a column in a csv file with this example of value:
I want to divide the column programme based on that semi column into two columns for example
program 1: H2020-EU.3.1.
program 2: H2020-EU.3.1.7.
This is what I wrote initially
import csv
import os
with open('IMI.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
with open('new_IMI.csv', 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter='\t')
#for line in csv_reader:
# csv_writer.writerow(line)
please note that after i do the split of columns I need to write the file again as a csv and save it to my computer
Please guide me
Using .loc to iterate through each row of a dataframe is somewhat inefficient. Better to split an entire column, with the expand=True to assign to the new columns. Also as stated, easy to use pandas here:
Code:
import pandas as pd
df = pd.read_csv('IMI.csv')
df[['programme1','programme2']] = df['programme'].str.split(';', expand=True)
df.drop(['programme'], axis=1, inplace=True)
df.to_csv('IMI.csv', index=False)
Example of output:
Before:
print(df)
id acronym status programme topics
0 945358 BIGPICTURE SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2019-18-01
1 821362 EBiSC2 SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2017-13-06
2 116026 HARMONY SIGNED H202-EU.3.1. IMI2-2015-06-04
After:
print(df)
id acronym status topics programme1 programme2
0 945358 BIGPICTURE SIGNED IMI2-2019-18-01 H2020-EU.3.1. H2020-EU3.1.7
1 821362 EBiSC2 SIGNED IMI2-2017-13-06 H2020-EU.3.1. H2020-EU3.1.7
2 116026 HARMONY SIGNED IMI2-2015-06-04 H2020-EU.3.1. None
You can use pandas library instead of csv.
import pandas as pd
df = pd.read_csv('IMI.csv')
p1 = {}
p2 = {}
for i in range(len(df)):
if ';' in df['programme'].loc[i]:
p1[df['id'].loc[i]] = df['programme'].loc[i].split(';')[0]
p2[df['id'].loc[i]] = df['programme'].loc[i].split(';')[1]
df['programme1'] = df['id'].map(p1)
df['programme2'] = df['id'].map(p2)
and if you want to delete programme column:
df.drop('programme', axis=1)
To save new csv file:
df.to_csv('new_file.csv', inplace=True)
I have the following code to import some data from a website, I want to convert my data variable into a dataframe.
I've tried with pd.DataFrame and pd.read_csv(io.StringIO(data), sep=";") but always show me an error.
import requests
import io
# load file
data = requests.get('https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT').content
# decode data
data = data.decode('latin-1')
# skip first 2 rows
data = data.split('\r\n')[2::]
del data[1]
# trying to fix csv structure
lines = []
lines_2 = []
for line in data:
line = ';'.join(line.split(';'))
if len(line) > 0 and line[0].isdigit():
lines.append(line)
lines_2.append(line)
else:
if len(lines) > 0:
lines_2.append(lines_2[-1] + line)
lines_2.remove(lines_2[-2])
else:
lines.append(line)
data = '\r\n'.join(lines_2)
print(data)
the expected ouput should be like this:
date 1 2
0 29/08/2020 HI RE ....
1 30/08/2020 HI RE ....
2 31/08/2020 HI RE ...
There are few rows that need to be added to the previos one (the main rows should be the rows who start by a date)
Prayson's answer is correct, but the skiprows parameter should also be used (otherwise the metadata is interpreted as column names).
import pandas as pd
df = pd.read_csv(
"https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT",
sep=";",
skiprows=2,
encoding='latin-1',
)
print(df)
You can read text/csv data directly from URL with pandas
import pandas as pd
URI = 'https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT'
df = pd.read_csv(URI, sep=';', encoding='latin1')
print(df)
pandas will do the downloading for you. So no need for requests or io.StringIO.
Here's a sample csv file;
out_gate,uless_col,in_gate,n_con
p,x,x,1
p,x,y,1
p,x,z,1
a_a,u,b,1
a_a,s,b,3
a_b,e,a,2
a_b,l,c,4
a_c,e,a,5
a_c,s,b,5
a_c,s,b,3
a_c,c,a,4
a_d,o,c,2
a_d,l,c,3
a_d,m,b,2
p,y,x,1
p,y,y,1
p,y,z,3
I want to remove the useless columns (2nd column) and useless rows (first three and last three rows) and create a new csv file and then save this new one. and How can I deal with the csv file that has more than 10 useless columns and useless rows?
(assuming useless rows are located only on the top or the bottom lines not scattered in the middle)
(and I am also assuming all the rows we want to use has its first element name starting with 'a_')
Can I get solution without using numpys or pandas as well? thanks!
Assuming that you have one or more unwanted columns and the wanted rows start with "a_".
import csv
with open('filename.csv') as infile:
reader = csv.reader(infile)
header = next(reader)
data = list(reader)
useless = set(['uless_col', 'n_con']) # Let's say there are 2 useless columns
mask, new_header = zip(*[(i,name) for i,name in enumerate(header)
if name not in useless])
#(0,2) - column mask
#('out_gate', 'in_gate') - new column headers
new_data = [[row[i] for i in mask] for row in data] # Remove unwanted columns
new_data = [row for row in new_data if row[0].startswith("a_")] # Remove unwanted rows
with open('filename.csv', 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(new_header)
writer.writerows(new_data)
You can try this:
import csv
data = list(csv.reader(open('filename.csv')))
header = [data[0][0]]+data[0][2:]
final_data = [[i[0]]+i[2:] for i in data[1:]][3:-3]
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows([header]+final_data)
Output:
out_gate,in_gate,n_con
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
d,c,2
d,c,3
d,b,2
Below solution uses Pandas.
As the pandas dataframe drop function suggests, you can do the following:
import pandas as pd
df = pd.read_csv("csv_name.csv")
df.drop(columns=['ulesscol'])
Above code is considering dropping columns, you can drop rows by index as:
df.drop([0, 1])
Alternatively, don't read in the column in the first place:
df = pd.read_csv("csv_name.csv",
usecols=["out_gate", "in_gate", "n_con"])
I'm trying to save specific columns to a csv using pandas. However, there is only one line on the output file. Is there anything wrong with my code? My desired output is to save all columns where d.count() == 1 to a csv file.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
for columns in d:
if (d[columns]).count() > 1:
(d[columns]).dropna(how='any').to_csv('output.csv')
An alternative might be to populate a new dataframe containing what you want to save, and then save that one time.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
keepcols=[]
for columns in d:
if (d[columns]).count() > 1:
keepcols.append(columns)
output_df = results[keepcols]
output_df.to_csv('output.csv')
No doubt you could rationalise the above, and reduce the memory footprint by saving the output directly without first creating an object to hold it, but it helps identify what's going on in the example.
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')