I have a CSV file with a lot of rows and different number of columns.
How to group data by count of columns and show it in different frames?
File CSV has the following data:
1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18
Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:
ID NAME STATE COUNTRY HOBBY
FR1: 1 OLEG US FRANCE BIG
ID NAME COUNTRY AGE
FR2: 1 OLEG FR 18
FR3:
ID NAME AGE
1 NATA 18
Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.
If you need to import the data with pandas, you could have a look at this post.
Related
I want to optimize a process of a "vlookup" in Python that works but is not scalable in its current form. I have tried pythons pivot.table and pivot but it's been limited due to alphanumeric and string values in cells. I have two tables:
table1:
ProductID
Sales
123456
34
abc123
34
123def
34
a1234f
34
1abcd6
34
table2:
Brand
Site1
Site2
Site3
Brand1
123456
N/A
N/A
Brand2
N/A
abc123
N/A
Brand1
N/A
N/A
123def
Brand2
N/A
1abcd6
N/A
Brand1
a1234f
N/A
N/A
What I originally wanted to see was sales by brand:
Brand
Sales
Brand1
102
Brand2
68
Here's the pseudocode I've basically built out in Python and Pandas:
# read sales and product tables into pandas
sales_df = pd.read_csv(table1)
product_df = pd.read_csv(table2)
# isolate each product id column into separate dfs
product_site1_df = product_df.drop(['Site2', 'Site3'],axis=1)
product_site2_df = product_df.drop(['Site1', 'Site3'],axis=1)
product_site3_df = product_df.drop(['Site1', 'Site2'],axis=1)
# rename and append all product ids into a single column
product_site1_df.rename(columns={"Site1": "ProductID"})
product_site2_df.rename(columns={"Site2": "ProductID"})
product_site3_df.rename(columns={"Site3": "ProductID"})
product_list_master_df = pd.concat([product_site1_df, product_site2_df, product_site3_df])
#compare sales df and product df, pulling brand in as a new column to the sales table
inner_join = pd.merge(sales_df,
product_df,
on ='ProductID',
how ='inner')
This is obviously very procedural, not scalable, computationally redundant, and seems very round-about to get to what I want. Additionally, I'm losing data such as if I want to do a pivot based on sites rather than sales. Short of changing the data model itself, what can I do here to improve speed, versatility, and lines of code?
Assuming the dataframes are named df1 and df2, you can reshape and map to perform the VLOOKUP, then groupby+sum:
(df2.set_index('Brand')
.stack()
.map(df1.set_index('ProductID')['Sales'])
.groupby(level='Brand').sum()
)
Output:
Brand
Brand1 102
Brand2 68
Here's how you can do it, without Pandas, just using Python's standard CSV lib, and a Counter (for your sales-by-brand):
import csv
from collections import Counter
# Create a product/sales lookup
sales_by_product = {}
with open('sales.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
p_id, sales = row
sales_by_product[p_id] = int(sales)
sales_by_brand_counter = Counter()
with open('products.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
brand_id = row[0]
for p_id in row[1:]:
sales = sales_by_product.get(p_id, 0)
sales_by_brand_counter[brand_id] += sales
with open('sales_by_brand.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Brand', 'Sales'])
rows = [[elem, cnt] for (elem, cnt) in sales_by_brand_counter.items()]
writer.writerows(rows)
When I run that with sales.csv:
ProductID,Sales
123456,34
abc123,34
123def,34
a1234f,34
1abcd6,34
and products.csv:
Brand,Site1,Site2,Site3
Brand1,123456,N/A,N/A
Brand2,N/A,abc123,N/A
Brand1,N/A,N/A,123def
Brand2,N/A,1abcd6,N/A
Brand1,a1234f,N/A,N/A
I get sales_by_brand.csv:
Brand,Sales
Brand1,102
Brand2,68
The work that really matters, finding product IDs and summing sales is handled here:
for row in reader:
brand_id = row[0]
for p_id in row[1:]:
sales = sales_by_product.get(p_id, 0)
sales_by_brand_counter[brand_id] += sales
It can read through as many Site columns as there are. If the site contains 'N/A' or a product ID that isn't in the lookup dict, it just adds 0 to that brand.
I have two CSVs. The first one contains a list of all previous customers with IDs assigned to them. And a new csv in which I'm auto generating IDs with following code:
df['ID'] = pd.to_datetime('today').strftime('%m%d%y') + df.index.map(str)
OLD.csv
ID FirstName LastName
1 John Smith
2 Jack Ma
3 John Wick
.... .... ....
210906ABC3 Jon Snow
210907ABC0 Peter Parker
210907ABC1 Tony Stark
NEW.csv with current script
ID FirstName LastName
210908ABC0 Black Widow
210908ABC1 Steve Rogers
210908ABC2 John Wick
210908ABC3 John Rambo
210908ABC4 Tony Stark
I need to compare the FirstName, LastName columns from the CSVs and if the customer already exists in OLD.csv, instead of generating a new ID, it should take the ID value from OLD.csv
Expected output for NEW.csv
ID FirstName LastName
210908ABC1 Black Widow
210908ABC2 Steve Rogers
3 John Wick
210908ABC3 John Rambo
1 John Smith
In the future I might need to compare three or four columns and only assign the IDs if all the columns match. FirstName and LastName and (CellPhone or Address) and (Location or SSN)
if you have both files in two dataframes df1 and df2 you can merge the two then update the ID in the first file and print only the columns from the first file, this will only work for files up to a few thousand rows as the merge is quite slow
df2.columns = [x + "_2" for x in df2.columns] # to avoid auto renaming by pd
result = pd.merge(df1, df2, how='left', left_on = key_cols1, right_on = key_cols2)
# update the ID column
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)
print(result.to_csv(index=False,columns=df1.columns))
Edit:
this is a simple working example, file1 (df1) is the file you want to update and file2 is the file that contains the IDs you want to copy over to file1
import pandas as pd, numpy as np, argparse, os
parser = argparse.ArgumentParser(description='update id in file1 with id from file2.')
parser.add_argument('-k', help='key column both file', required=True)
parser.add_argument('file1', help='file1 to be updated')
parser.add_argument('file2', help='file2 contains updates for file1')
args = parser.parse_args()
if not os.path.isfile(args.file1): raise ValueError('File does not exist: ' + args.file1)
if not os.path.isfile(args.file2): raise ValueError('File does not exist: ' + args.file2)
df1 = pd.read_csv(args.file1,dtype=str,header=0)
df2 = pd.read_csv(args.file2,dtype=str,header=0)
df2.columns = [x + "_2" for x in df2.columns]
key_col1 = [list(df1.columns)[int(x)] for x in args.k.split(",")]
key_col2 = [list(df2.columns)[int(x)] for x in args.k.split(",")]
result = pd.merge(df1, df2, how='left', left_on = key_col1, right_on = key_col2)
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)
print(result.to_csv(index=False,columns=df1.columns))
use as follows:
$ python merge.py -k 1,2 file1.csv file2.csv
ID,FirstName,LastName
210908ABC0,Black,Widow
210908ABC1,Steve,Rogers
3,John,Wick
210908ABC3,John,Rambo
210907ABC1,Tony,Stark
make sure that the key is unique per row otherwise you can get multiple joins generating extra rows in the output file.
Sample data from text file
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole#123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
Wondering if someone could help me, you can see my sample dataset above. What I would like to do (please tell me if there is a more efficient way) is to loop through the first column and whereever the list of unique ids occur (e.g first_name, last_name, role etc) append the value in the corresponding row to that list and do this which each unique ID so that I'm left with the below.
I have read about multi-indexing and I'm not sure if that might be a better solution but I couldn't get it to work (I'm quite new to python)
enter image description here
# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []
# Iterate each element from the selected list
for index, sList in enumerate(textFile):
# Match the element with the element of searchList
if sList in searchList:
# Store the value in foundList if the match is found
foundList.append(selectedList[index])
You have a text file where each records starts with a [User] line and data lines have a key=value format. I know no module able to automatically handle that, but it is easy to parse it by hand. Code could be:
with open('file.txt') as fd:
data = [] # a list of records
for line in fd:
line = line.strip() # strip end of line
if line == '[User]': # new record
row = {} # row will be a key: value dict
data.append(row)
else:
k,v = line.split('=', 1) # split on the = character
row[k] = v
df = pd.DataFrame(data) # list of key: value dicts => dataframe
With the sample data shown, we get:
employeeNo last_name first_name language email department role email Location
0 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead NaN NaN
1 456 Ronaldo Juan Spanish NaN Data Science Team Lead juan.ronaldo#sms.ie Spain
2 998 Lee Damian english NaN NaN NaN damian.lee#email.com NaN
I'm sure there is a more optimal way to do this, but it would be to get a unique list of row names, this time extracting them in a loop process and combining them into a new dataframe. Finally, update it with the desired column names.
import pandas as pd
import numpy as np
import io
data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole#123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
'''
df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
tmp = df[df[0] == col]
tmp.reset_index(inplace=True)
new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
new_df
User employeeNo last_name first_name language email department role Location
0 None 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead Spain
1 None 456 Ronaldo Juan Spanish juan.ronaldo#sms.ie Data Science Team Lead NaN
2 None 998 Lee Damian english damian.lee#email.com NaN NaN NaN
Rewrite based on testing of previous version offset values
import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys -
# inadvertently assigned (Location) value of second record to the first record
# which did not have a Location key
# This version should perform better - only dealing with one single df
# - and using pandas own pivot() function
textFile = 'file.txt'
filter = '[User]'
# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter) # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))
# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0
#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
for fileLineSeq, line in enumerate(fp):
if filter in str(line):
userSeq = userSeq + 1 # Ensures each key value pair is grouped
else: userSeq = userSeq
oneRow = [fileLineSeq, userSeq, line]
allData.append(oneRow)
df = pd.DataFrame(allData)
df.columns = ['FileRow','UserSeq','KeyValue'] # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' , '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' , '' , regex=True ) # remove the new lines appended during the list generation
# print(df) # Test as necessary here
# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value')
print(df)
Results
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole#123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee#email.com 998 Damian english Lee NaN
I used python to read a file which contains the baby's names, genders and birth-years. Now I want to find out the names which are used both by boys and girls. I used value_counts()to get the appearance times of each name, but now I don't know how to extract the names from all the names.
Here is my codes:
def names_both(year):
names = []
path = 'babynames/yob%d.txt' % year
columns = ['name', 'sex', 'birth']
frame = pd.read_csv(path, names=columns)
frame = frame['name'].value_counts()
print(frame)
"""if len(names) != 0:
print(names)
else:
print('None')"""
The frame now is like this:
Lou 2
Willie 2
Erie 2
Cora 2
..
Perry 1
Coy 1
Adolphus 1
Ula 1
Emily 1
Name: name, Length: 1889, dtype: int64
Here is the csv:
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
Annie,F,1258
Clara,F,1226
Ella,F,1156
Florence,F,1063
...
Thanks for helping!
Here we are for counting the number of names given both to girls and boys:
common_girl_and_boys_names = (
# work name by name
frame.groupby('name')
# count the number of sex given for the name and keep the one given to both sex, this boolean will be put in a column call 0
.apply(lambda x: len(x['sex'].unique()) == 2)
# the name are now in the index, reset it in order to get the names
.reset_index()
# keep only names with the column 0 with True value
.loc[lambda x: x[0], 'name']
)
final_df = (
# keep only the names common to boys and girls (the series build before)
frame.loc[frame['name'].isin(common_girl_and_boys_names), :]
# sex is now useless
.drop(['sex'], axis='columns')
# work name by name and sum the number of birth
.groupby('name')
.sum()
)
You can put those lines after the read_csv function. I hope it is want you want.
This may be a simple/repeat question, but I could find/figure out yet how to do it.
I have two csv files:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,
and then
age.csv:
student_id,age_1
3124,20
9087,21
1234,45
I want to compare the two csv files, based on the columns "id" from info.csv and "student_id" from age.csv and take the corresponding "age_1" data and put it into the "age" column in info.csv.
So the final output should be:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,45
bcd, uvw, 3124, 813-222-1111, tre,20
poi, ccc, 9087, 123-45607890, weq,21
I am able to simply join the tables based on the keys into a new.csv, but can't put the data in the columns titles "age". I used "csvkit" to do that.
Here is what I used:
csvjoin -c 3,1 info.csv age.csv > new.csv
You can use Pandas and update the info dataframe using the age data. You do it by setting the index of both data frames to ID and student_id respectively, then update the age column in the info dataframe. After that you reset the index so ID becomes a column again.
from StringIO import StringIO
import pandas as pd
info = StringIO("""Last Name,First Name,ID,phone,adress,age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,""")
age = StringIO("""student_id,age_1
3124,20
9087,21
1234,45""")
info_df = pd.read_csv(info, sep=",", engine='python')
age_df = pd.read_csv(age, sep=",", engine='python')
info_df = info_df.set_index('ID')
age_df = age_df.set_index('student_id')
info_df['age X [Total age: 100] |009076'].update(age_df.age_1)
info_df.reset_index(level=0, inplace=True)
info_df
outputs:
ID Last Name First Name phone adress age X [Total age: 100] |009076
0 1234 abc xyz 982-128-0000 pqt 45
1 3124 bcd uvw 813-222-1111 tre 20
2 9087 poi ccc 123-45607890 weq 21
Try this...
import csv
info = list(csv.reader(open("info.csv", 'rb')))
age = list(csv.reader(open("age.csv", 'rb')))
def copyCSV(age, info, outFileName = 'out.csv'):
# put age into dict, indexed by ID
# assumes no duplicate entries
# 1 - build a dict ageDict to represent data
ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])
# 2 - setup output
with open(outFileName, 'wb') as outFile:
outwriter = csv.writer(outFile)
# 3 - run through info and slot in ages and write to output
# nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
outwriter.writerow(info[0])
for entry in info[1:]:
if entry != []:
key = entry[2].replace(' ','')
if key in ageDict: # checks that you have data from age.csv
entry[5] = ageDict[key]
outwriter.writerow(entry)
copyCSV(age, info)
Let me know if it works or if anything is unclear. I've used a dict because it should be faster if your files are massive, as you only have to loop through the data in age.csv once.
There may be a simpler way / something already implemented...but this should do the trick.