I've created a program so that I can split the dataset's column into 4 columns but some of my datasets only have 2 columns so when that section is reached, an error is thrown. I believe an if else statement can help with this.
Here is the code for my program:
import pandas as pd
import os
# reading csv file from url
filepath = "C:/Users/username/folder1/folder2/folder3/b 2 col.csv"
file_encoding = 'cp1252'
data = pd.read_csv(filepath , header=None, names = list(range(0,4)) , encoding=file_encoding)
data.columns =['ID', 'Name', 'S ID', 'SName']
# new data frame with split value columns
new = data["Name"].str.split(",", n = 1, expand = True)
# making separate first name column from new data frame
data["Last Name"]= new[0]
# making separate last name column from new data frame
data["First Name"]= new[1]
# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)
# making separate first name column from new data frame
data["S Last Name"]= new[0]
# making separate last name column from new data frame
data["S First Name"]= new[1]
# Saving File name as its path
filename = os.path.basename(filepath) + ".xlsx"
data.to_excel(filename, index=False)
data
This section onwards is responsible for the splitting of the second set of data
# new data frame with split value columns (2)
new = data["SName"].str.split(",", n = 1, expand = True)
Problem is not all of my CSV have four columns, so if I can implement an if else here to check if data is present then proceed else skip and move to the next section:
# Saving File name as its path
filename = os.path.basename(filepath) + ".xlsx"
data.to_excel(filename, index=False)
data
I believe the program would work with my datasets
Link to an example of my datasets: https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing
Thanks for helping
IIUC, assuming the (.csv) files are in the same folder, here is a proposition with pandas.concat :
import pandas as pd
import os
filepath = "C:/Users/username/folder1/folder2"
file_encoding = "cp1252"
list_df = []
for filename in os.listdir(filepath):
if filename.endswith(".csv"):
df = pd.read_csv(os.path.join(filepath, filename),
header=None, encoding=file_encoding, on_bad_lines="skip")
df = (pd.concat([df.iloc[:, i:i+5].pipe(lambda df_: df_.rename(columns={col:i for i, col in enumerate(df_.columns)}))
for i in range(0, df.shape[1], 5)], axis=0)
.set_axis(["ID", "FullName", "Street No", "Street Add 1", "Street Add 2"], axis=1)
.dropna(how="all"))
df.insert(0, "filename", filename) #comment this line if don't want to show the filename as a column
list_df.append(df)
out = (pd.concat(list_df, ignore_index=True)
.pipe(lambda df_: df_.join(df_["FullName"]
.str.split(", ", expand=True)
.rename(columns={0: "FirstName", 1: "LastName"}))))
Output :
print(out.head())
filename ID FullName Street No Street Add 1 Street Add 2 FirstName LastName
0 a 4 col upd.csv NaN Name NaN NaN NaN Name None
1 a 4 col upd.csv 1.0 Bruce, Wayne Street No 1 Street Add 1 Street Add 2 Bruce Wayne
2 a 4 col upd.csv 2.0 James, Gordon Street No 2 Street Add 2 Street Add 3 James Gordon
3 a 4 col upd.csv 3.0 Fish, Mooney Street No 3 Street Add 3 Street Add 4 Fish Mooney
4 a 4 col upd.csv 4.0 Selina, Kyle Street No 4 Street Add 4 Street Add 5 Selina Kyle
You can split your 2 column to 4 column like that.
#if there are some missing columns
data['First Name'] = np.nan
data['Last Name'] = np.nan
data['S First Name'] = np.nan
data['S Last Name'] = np.nan
#if there is not missing values remove above
data[['First Name', 'Last Name']] = data.Name.astype(str).str.split(",", expand=True)
data[['S First Name', 'S Last Name']] = data.SName.astype(str).str.split(",", expand=True)
Related
I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.
It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare
There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df
pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)
Sample data from text file
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole#123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
Wondering if someone could help me, you can see my sample dataset above. What I would like to do (please tell me if there is a more efficient way) is to loop through the first column and whereever the list of unique ids occur (e.g first_name, last_name, role etc) append the value in the corresponding row to that list and do this which each unique ID so that I'm left with the below.
I have read about multi-indexing and I'm not sure if that might be a better solution but I couldn't get it to work (I'm quite new to python)
enter image description here
# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []
# Iterate each element from the selected list
for index, sList in enumerate(textFile):
# Match the element with the element of searchList
if sList in searchList:
# Store the value in foundList if the match is found
foundList.append(selectedList[index])
You have a text file where each records starts with a [User] line and data lines have a key=value format. I know no module able to automatically handle that, but it is easy to parse it by hand. Code could be:
with open('file.txt') as fd:
data = [] # a list of records
for line in fd:
line = line.strip() # strip end of line
if line == '[User]': # new record
row = {} # row will be a key: value dict
data.append(row)
else:
k,v = line.split('=', 1) # split on the = character
row[k] = v
df = pd.DataFrame(data) # list of key: value dicts => dataframe
With the sample data shown, we get:
employeeNo last_name first_name language email department role email Location
0 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead NaN NaN
1 456 Ronaldo Juan Spanish NaN Data Science Team Lead juan.ronaldo#sms.ie Spain
2 998 Lee Damian english NaN NaN NaN damian.lee#email.com NaN
I'm sure there is a more optimal way to do this, but it would be to get a unique list of row names, this time extracting them in a loop process and combining them into a new dataframe. Finally, update it with the desired column names.
import pandas as pd
import numpy as np
import io
data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole#123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
'''
df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
tmp = df[df[0] == col]
tmp.reset_index(inplace=True)
new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
new_df
User employeeNo last_name first_name language email department role Location
0 None 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead Spain
1 None 456 Ronaldo Juan Spanish juan.ronaldo#sms.ie Data Science Team Lead NaN
2 None 998 Lee Damian english damian.lee#email.com NaN NaN NaN
Rewrite based on testing of previous version offset values
import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys -
# inadvertently assigned (Location) value of second record to the first record
# which did not have a Location key
# This version should perform better - only dealing with one single df
# - and using pandas own pivot() function
textFile = 'file.txt'
filter = '[User]'
# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter) # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))
# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0
#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
for fileLineSeq, line in enumerate(fp):
if filter in str(line):
userSeq = userSeq + 1 # Ensures each key value pair is grouped
else: userSeq = userSeq
oneRow = [fileLineSeq, userSeq, line]
allData.append(oneRow)
df = pd.DataFrame(allData)
df.columns = ['FileRow','UserSeq','KeyValue'] # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' , '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' , '' , regex=True ) # remove the new lines appended during the list generation
# print(df) # Test as necessary here
# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value')
print(df)
Results
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole#123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee#email.com 998 Damian english Lee NaN
I'm trying to delete a row when a cell is empty from the 'calories.xlsx' spreadsheet and send all data, except empty rows, to the 'destination.xlsx' spreadsheet. The code below is how far I got. But still, it does not delete rows that have an empty value based on the calories column.
This is the data set:
Data Set
How can I develop my code to solve this problem?
import pandas as pd
FileName = 'calories.xlsx'
SheetName = pd.read_excel(FileName, sheet_name = 'Sheet1')
df = SheetName
print(df)
ListCalories = ['Calories']
print(ListCalories)
for Cell in ListCalories:
if Cell == 'NaN':
df.drop[Cell]
print(df)
df.to_excel('destination.xlsx')
Create dummy data
df=pd.DataFrame({
'calories':[2306,3256,1235,np.nan,3654,3256],
'Person':['person1','person2','person3','person4','person5','person6',]
})
Print data frame
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
3 person4
4 3654.0 person5
5 3256.0 person6
remove row, if calories value is missing
new_df=df.dropna(how='any',subset=['calories'])
Result
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
4 3654.0 person5
5 3256.0 person6
save as excel
new_df.to_excel('destination.xlsx',index=False)
your ListCalories contains only one element which is Calories, I'll assume this was a typo.
what you are trying to probably do is
import pandas as pd
FileName = 'calories.xlsx'
df = pd.read_excel(FileName, sheet_name = 'Sheet1')
print(df)
# you don't need this, but I kept it for you
ListCalories = df['Calories']
print(ListCalories)
clean_df = df[df['Calories'].notna()] # this will only select the rows that doesn't have na value in the Calories column
print(clean_df)
clean_df.to_excel('destination.xlsx')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html
I used python to read a file which contains the baby's names, genders and birth-years. Now I want to find out the names which are used both by boys and girls. I used value_counts()to get the appearance times of each name, but now I don't know how to extract the names from all the names.
Here is my codes:
def names_both(year):
names = []
path = 'babynames/yob%d.txt' % year
columns = ['name', 'sex', 'birth']
frame = pd.read_csv(path, names=columns)
frame = frame['name'].value_counts()
print(frame)
"""if len(names) != 0:
print(names)
else:
print('None')"""
The frame now is like this:
Lou 2
Willie 2
Erie 2
Cora 2
..
Perry 1
Coy 1
Adolphus 1
Ula 1
Emily 1
Name: name, Length: 1889, dtype: int64
Here is the csv:
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
Annie,F,1258
Clara,F,1226
Ella,F,1156
Florence,F,1063
...
Thanks for helping!
Here we are for counting the number of names given both to girls and boys:
common_girl_and_boys_names = (
# work name by name
frame.groupby('name')
# count the number of sex given for the name and keep the one given to both sex, this boolean will be put in a column call 0
.apply(lambda x: len(x['sex'].unique()) == 2)
# the name are now in the index, reset it in order to get the names
.reset_index()
# keep only names with the column 0 with True value
.loc[lambda x: x[0], 'name']
)
final_df = (
# keep only the names common to boys and girls (the series build before)
frame.loc[frame['name'].isin(common_girl_and_boys_names), :]
# sex is now useless
.drop(['sex'], axis='columns')
# work name by name and sum the number of birth
.groupby('name')
.sum()
)
You can put those lines after the read_csv function. I hope it is want you want.
I am reading from a csv file using python and I pulled the data from one column. Every 15 lines is a set of data for one category but only the first 5 lines from that set is relevant. How can I read the first 5 lines of every 15 lines from a total of 205 lines? It reads every other 15 up to a point and then it begins to misalign. The blue rectangles shows where it starts to stray.Here is an image of my data format from the column :
inp = pd.read_csv(input_dir)
win = inp['File '][1:]
ny4= inp['Unnamed: 24']
df = pd.DataFrame({"Ny/4" : ny4})
ny4_len = len(ny4)
#
for i in win:
itr.append(i.split('_'))
for e in itr:
plh1.append(e[4:e.index("SFRP")])
plh1 = plh1[1::13]
win_df = pd.DataFrame({'Window' : plh1})
for u in win_df['Window']:
plh2.append(k.join(u))
chicken = len(df)/12
#kow = list(islice(ny4,1, 17) )
#
maxm = pd.concat(list(map(lambda x: x[1:6], np.array_split(ny4,17))), ignore_index=True)
plh2_df= pd.DataFrame({'Window Name': plh2})
ny4_data= pd.DataFrame(np.reshape(maxm.values,(17,5)), columns = ['Center', 'UL', 'UR', 'LL','LR'])
conc= pd.concat([plh2_df,ny4_data], axis=1, sort=True)
[1]: https://i.stack.imgur.com/yMqw8.png
Use pd.concat with np.array_split:
print(pd.concat(list(map(lambda x: x[:5], np.array_split(df, len(df) / 15))), ignore_index=True))
It should work now.