I have the following code (below) that grabs to CSV files and merges data into one consolidated CSV file.
I now need to grab specific information from one of the columns add that information to another column.
What I have now is one output.csv file with the following sample data:
ID,Name,Flavor,RAM,Disk,VCPUs
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2
What I need to do is open this CSV file and split the data in the Name column across two columns as followed:
ID,Name,Flavor,RAM,Disk,VCPUs,Customer,Misc
45fc754d-6a9b-4bde-b7ad-be91ae60f582,customer1-test1-dns,m1.medium,4096,40,2,customer1,test1-dns
83dbc739-e436-4c9f-a561-c5b40a3a6da5,customer2-test2,m1.tiny,128,1,1,customer2,test2
ef68fcf3-f624-416d-a59b-bb8f1aa2a769,customer3-test3-dns-api,m1.medium,4096,40,2,customer3,test3-dns-api
Note how the Misc column can have multiple values split by one or multiple -.
How can I accomplish this via Python. Below is the code I have now:
import csv
import os
import pandas as pd
by_name = {}
with open('flavor.csv') as b:
for row in csv.DictReader(b):
name = row.pop('Name')
by_name[name] = row
with open('output.csv', 'w') as c:
w = csv.DictWriter(c, ['ID', 'Name', 'Flavor', 'RAM', 'Disk', 'VCPUs'])
w.writeheader()
with open('instance.csv') as a:
for row in csv.DictReader(a):
try:
match = by_name[row['Flavor']]
except KeyError:
continue
row.update(match)
w.writerow(row)
Try this:
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
df
Output:
ID Name Flavor RAM Disk VCPUs Customer Misc
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2 customer1 test1-dns
1 83dbc739-e436-4c9f-a561-c5b40a3a6da5 customer2-test2 m1.tiny 128 1 1 customer2 test2
2 ef68fcf3-f624-416d-a59b-bb8f1aa2a769 customer3-test3-dns-api m1.medium 4096 40 2 customer3 test3-dns-api
I would recommend switching over to pandas. Here's the official Getting Started documentation.
Let's first read in the csv.
import pandas as pd
df = pd.read_csv('input.csv')
print(df.head(1))
You should get something similar to:
ID Name Flavor RAM Disk VCPUs
0 45fc754d-6a9b-4bde-b7ad-be91ae60f582 customer1-test1-dns m1.medium 4096 40 2
After that, use string manipulation in the Pandas Series:
df[['Customer','Misc']] = df.Name.str.split('-', n=1, expand=True)
Finally, you can save the csv.
df.to_csv('output.csv')
This code would be much elegant and simpler if you used pandas.
import pandas as pd
df = pd.read_csv('flavor.csv')
df[['Customer','Misc']] = df['Name'].str.split(pat='-',n=1,expand=True)
df.to_csv('output.csv',index=False)
Documentation ref
Here is how I do it. The trick is in the function "split()" :
import pandas as pd
file = pd.read_csv(r"C:\...\yourfile.csv",sep=",")
file['Customer']=None
file['Misc']=None
for x in range(len(file)):
temp=file.Name[x].split('-', maxsplit=1)
file['Customer'].iloc[x] = temp[0]
file['Misc'].iloc[x] = temp[1]
file.to_csv(r"C:\...\yourfile_result.csv")
Related
Hello everyone I am learning python I am new I have a column in a csv file with this example of value:
I want to divide the column programme based on that semi column into two columns for example
program 1: H2020-EU.3.1.
program 2: H2020-EU.3.1.7.
This is what I wrote initially
import csv
import os
with open('IMI.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
with open('new_IMI.csv', 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter='\t')
#for line in csv_reader:
# csv_writer.writerow(line)
please note that after i do the split of columns I need to write the file again as a csv and save it to my computer
Please guide me
Using .loc to iterate through each row of a dataframe is somewhat inefficient. Better to split an entire column, with the expand=True to assign to the new columns. Also as stated, easy to use pandas here:
Code:
import pandas as pd
df = pd.read_csv('IMI.csv')
df[['programme1','programme2']] = df['programme'].str.split(';', expand=True)
df.drop(['programme'], axis=1, inplace=True)
df.to_csv('IMI.csv', index=False)
Example of output:
Before:
print(df)
id acronym status programme topics
0 945358 BIGPICTURE SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2019-18-01
1 821362 EBiSC2 SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2017-13-06
2 116026 HARMONY SIGNED H202-EU.3.1. IMI2-2015-06-04
After:
print(df)
id acronym status topics programme1 programme2
0 945358 BIGPICTURE SIGNED IMI2-2019-18-01 H2020-EU.3.1. H2020-EU3.1.7
1 821362 EBiSC2 SIGNED IMI2-2017-13-06 H2020-EU.3.1. H2020-EU3.1.7
2 116026 HARMONY SIGNED IMI2-2015-06-04 H2020-EU.3.1. None
You can use pandas library instead of csv.
import pandas as pd
df = pd.read_csv('IMI.csv')
p1 = {}
p2 = {}
for i in range(len(df)):
if ';' in df['programme'].loc[i]:
p1[df['id'].loc[i]] = df['programme'].loc[i].split(';')[0]
p2[df['id'].loc[i]] = df['programme'].loc[i].split(';')[1]
df['programme1'] = df['id'].map(p1)
df['programme2'] = df['id'].map(p2)
and if you want to delete programme column:
df.drop('programme', axis=1)
To save new csv file:
df.to_csv('new_file.csv', inplace=True)
I am working with a data Adult that I have changed and would like to save it as a csv. however after saving it as a csv and re-loading the data to work with again, the data is not converted properly. The headers are not preserved and some columns are now combined. I have looked through the page and online, but what I have tried is not working. I load the data in with the following code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
After inserting missing values and changing the data frame as desired I have tried:
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
and a few other variations. How can I save the file to a CSV and preserve the correct format for the next time I read the file in?
When re-loading the data I use the code:
import pandas as pd
df = pd.read_csv('file_name.csv')
when running df.head the output is:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
and print(df.loc[:,"age"].value_counts()) the output is:
36 898
31 888
34 886
23 877
35 876
which should not have 2 columns
If you pickle it like so:
Adult.to_pickle('adult.pickle')
You will, subsequently, be able to read it back in using read_pickle as follows:
original_adult = pd.read_pickle('adult.pickle')
Hope that helps.
If you want to preserve the output column order you can specify the columns directly while saving the DataFrame:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
You can add parameter index=False to the to_csv('file_name.csv', index=False) function if you are not interested in saving the DataFrame row index. Otherwise, while reading the csv file again you'd need to specify the index_col parameter.
According to the documentation value_counts() returns a Series object - you see two columns because the first one is the index - Age (36, 31, ...), and the second is the count (898, 888, ...).
I replicated your code and it works for me. The order of the columns is preserved.
Let me show what I tried. Tried this batch of code:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
This worked perfectly. Then
df = Adult
This also worked.
Then I saved this data frame to a csv file. Make sure you are providing the absolute path to the file even if is is being saved in the same folder as this script.
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
Load this csv file into a new_df. It will generate a new column for keeping track of index. It is unnecessary and you can drop it like following:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
When I compare the columns of the new_df from the original df, with this line of code
new_df.columns == df.columns
I get
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
You might not have been providing the absolute path to the file or saving the file twice as here. You only need to save it once.
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
When you save the dataframe in general, the first column is the index, and you sould load the index when reading the dataframe, also whenever you assign a dataframe to a variable make sure to copy the dataframe:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
And to read:
df = pd.read_csv('file_name.csv', index_col=0)
The first columns from print(df.loc[:,"age"].value_counts()) is the index column which is shown if you query the datframe, to save this to a list, use the to_listmethod:
print(df.loc[:,"age"].value_counts().to_list())
I am having below file(file1.xlsx) as input. In total i am having 32 columns in this file and almost 2500 rows. Just for example i am mentioning 5 columns in screen print
I want to edit same file with python and want output as (file1.xlsx)
it should be noted i am adding one column named as short and data is a kind of substring upto first decimal of data present in name(A) column of same excel.
Request you to please help
Regards
Kawaljeet
Here is what you need...
import pandas as pd
file_name = "file1.xlsx"
df = pd.read_excel(file_name) #Read Excel file as a DataFrame
df['short'] = df['Name'].str.split(".")[0]
df.to_excel("file1.xlsx")
hello guys i solved the problem with below code:
import pandas as pd
import os
def add_column():
file_name = "cmdb_inuse.xlsx"
os.chmod(file_name, 0o777)
df = pd.read_excel(file_name,) #Read Excel file as a DataFrame
df['short'] = [x.split(".")[0] for x in df['Name']]
df.to_excel("cmdb_inuse.xlsx", index=False)
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
I have a CSV file which basically looks like the following (I shortened it to a minimal example showing the structure):
ID1#First_Name
TIME_BIN,COUNT,AVG
09:00-12:00,100,50
15:00-18:00,24,14
21:00-23:00,69,47
ID2#Second_Name
TIME_BIN,COUNT,AVG
09:00-12:00,36,5
15:00-18:00,74,68
21:00-23:00,22,76
ID3#Third_Name
TIME_BIN,COUNT,AVG
09:00-12:00,15,10
15:00-18:00,77,36
21:00-23:00,55,18
As one can see, the data is separated into multiple blocks. Each block has a headline (e.g. ID1#First_Name) which contains two peaces of information (IDx and x_Name), separated by #.
Each headline is followed by the column headers (TIME_BIN, COUNT, AVG) which stay the same for all blocks.
Then follow some lines of data which belong to the column headers (e.g. TIME_BIN=09:00-12:00, COUNT=100, AVG=50).
I would like to parse this file into a Pandas dataframe which would look like the following:
ID Name TIME_BIN COUNT AVG
ID1 First_Name 09:00-12:00 100 50
ID1 First_Name 15:00-18:00 24 14
ID1 First_Name 21:00-23:00 69 47
ID2 Second_Name 09:00-12:00 36 5
ID2 Second_Name 15:00-18:00 74 68
ID2 Second_Name 21:00-23:00 22 76
ID3 Third_Name 09:00-12:00 15 10
ID3 Third_Name 15:00-18:00 77 36
ID3 Third_Name 21:00-23:00 55 18
This means that the headline may not be skipped but has to be split by the # and then linked to the data from the block it belongs to. Besides, the column headers are only needed once since they do not change later on.
Somehow I managed to achieve my goal with the following code. However, the approach looks kind of overcomplicated and not robust to me and I am sure that there are better ways to do this. Any suggestions are welcome!
import pandas as pd
from io import StringIO (<- Python 3, for Python 2 use from StringIO import StringIO)
pathToFile = 'mydata.txt'
# read the textfile into a StringIO object and skip the repeating column header rows
s = StringIO()
with open(pathToFile) as file:
for line in file:
if not line.startswith('TIME_BIN'):
s.write(line)
# reset buffer to the beginning of the StringIO object
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['TIME_BIN', 'COUNT', 'AVG'])
# split the headline string which is currently found in the TIME_BIN column and insert both parts as new dataframe columns.
# the headline is identified by its start which is 'ID'
df['ID'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(0)
df['Name'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(1)
# fill the NaN values in the ID and Name columns by propagating the last valid observation
df['ID'] = df['ID'].fillna(method='ffill')
df['Name'] = df['Name'].fillna(method='ffill')
# remove all rows where TIME_BIN starts with 'ID'
df['TIME_BIN'] = df['TIME_BIN'].drop(df[df.TIME_BIN.str.startswith('ID')].index)
df = df.dropna(subset=['TIME_BIN'])
# reorder columns to bring ID and Name to the front
cols = list(df)
cols.insert(0, cols.pop(cols.index('Name')))
cols.insert(0, cols.pop(cols.index('ID')))
df = df.ix[:, cols]
import pandas as pd
from StringIO import StringIO
import sys
pathToFile = 'mydata.txt'
f = open(pathToFile)
s = StringIO()
cur_ID = None
for ln in f:
if not ln.strip():
continue
if ln.startswith('ID'):
cur_ID = ln.replace('\n',',',1).replace('#',',',1)
continue
if ln.startswith('TIME'):
continue
if cur_ID is None:
print 'NO ID found'
sys.exit(1)
s.write(cur_ID + ln)
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['ID','Name','TIME_BIN', 'COUNT', 'AVG'])