Printing a Python pandas data frame with padding - python

I'm writing a short script to validate data in a CSV and I'm formatting the results to dump to stdout and for readability, I'm adding 5 space padding on the left. Note: I'm NOT using format because I don't want to justify output.
Code:
def duplicate_data():
dup_df = inventory_df[inventory_df.duplicated(['STORE_NO','SKU'],keep=False)]
if dup_df.empty:
print(five, 'INFO: No Duplicate Entries Found')
else:
#print('\n')
print(five, 'WARN: Duplicate STORE_ID and SKU Data Found!')
print(five, dup_df.to_string(index=False))
Results:
It all works great until it prints the data frame:
WARN: Duplicate STORE_ID and SKU Data Found!
Please Copy/Paste the following and send to the customer:
STORE_NO SKU ON_HAND_QTY
10000001 1000000000007 2
10000002 1000000000007 8
I could iterate over the rows but the formatting is worse than the example above.
for rows in dup_df.iterrows():
print(five,rows)
Any thoughts as to how I can format the data frame output?

Not super nice but you could to do something like this:
def padlines(text, padding):
return "\n".join(padding + l for l in text.splitlines())
And then padlines(df.to_string(), five)

Related

How to divide a pandas data frame into sublists of n at a time?

I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.
I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')
I took the following from this question
df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()
in order to separate the lists :
dgroups =[]
for i in range(0,len(authors)-1,2):
dgroups.append(df.loc[df.B==authors[i]])
dgroups.extend(df.loc[df.B ==authors[i+1]])
but instead it gives me sub-lists like this:
dgroups = [['A'],['B'],
[tweet,author],
['A'],['B'],
[tweet,author2]]
prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:
for i in authors:
groups.append(df.loc[df.B==i])
so how would i do that for 2 authors or 3 authors or like that?
EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :
dgroups= []
for i in range(2,len(authors)+1,2):
tempset1=[]
tempset2=[]
tempset1 = df.loc[df.B==authors[i-2]]
if(i-1 != len(authors)):
tempset2=df.loc[df.B ==authors[i-1]]
dgroups.append(tempset1.append(tempset2))
else:
dgroups.append(tempset1)
This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.
pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()
authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
# print(authors[i-auths_in_subset:i])
dft = df[df.B.isin(authors[i-auths_in_subset:i])]
# print(dft)
dft.to_csv('df' + str(i) + '.csv')

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

How to import data from a .txt file into arrays in python

I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)

Formatting the data correctly from text file using pandas Python

I have data in my .txt file
productname1
7,64
productname2
6,56
4.73
productname3
productname4
12.58
10.33
So the data is explained here. We have product name in the first name and in the 2nd line is the price. But for 2nd product name we have original product price and discounted price. Also, the prices sometimes contain '.' and ',' to represent cents. I want to format the data in the following way
Product o_price d_price
productname1 7.64 -
productname2 6.56 4.73
productname3 - -
productname4 12.58 10.33
My current approach is a bit naive but it works for 98% of the cases
import pandas as pd
data = {}
tempKey = []
with open("myfile.txt", encoding="utf-8") as file:
arr_content = file.readlines()
for val in arr_content:
if not val[0].isdigit():# check whether Starting letter is a digit or text
val = ' '.join(val.split()) # Remove extra spaces
data.update({val: []}) # Adding key to the dict and initializing it with a list in which I'll populate values
tempKey.append(val) # keeping track of the last key added because dicts are not sequential
else:
data[str(tempKey[-1])].append(val) # Using last added key and updating it with prices
df = pd.DataFrame(list(data.items()), columns = ['Product', 'Pricelist'])
df[['o_price', 'd_price']] = pd.DataFrame([x for x in df.Pricelist])
df = df.drop('Prices', axis=1)
So this technique does not work when product name starts with a digit. Any suggestions for a better approach ?
Use a regular expression to check if the line contains only numbers and/or periods.
if (re.match("^[0-9\.]*$", val)):
# This is a product price
else:
# This is a product name

Pandas GroupBy Mean of Large DataSet in CSV

A common SQLism is "Select A, mean(X) from table group by A" and I would like to replicate this in pandas. Suppose that the data is stored in something like a CSV file and is too big to be loaded into memory.
If the CSV could fit in memory a simple two-liner would suffice:
data=pandas.read_csv("report.csv")
mean=data.groupby(data.A).mean()
When the CSV cannot be read into memory one might try:
chunks=pandas.read_csv("report.csv",chunksize=whatever)
cmeans=pandas.concat([chunk.groupby(data.A).mean() for chunk in chunks])
badMeans=cmeans.groupby(cmeans.A).mean()
Except that the resulting cmeans table contains repeated entries for each distinct value of A, one for each appearance of that value of A in distinct chunks (since read_csv's chunksize knows nothing about the grouping fields). As a result the final badMeans table has the wrong answer... it needs to compute a weighted average mean.
So a working approach seems to be something like:
final=pandas.DataFrame({"A":[],"mean":[],"cnt":[]})
for chunk in chunks:
t=chunk.groupby(chunk.A).sum()
c=chunk.groupby(chunk.A).count()
cmean=pandas.DataFrame({"tot":t,"cnt":c}).reset_index()
joined=pandas.concat(final,cmean)
final=joined.groupby(joined.A).sum().reset_indeX()
mean=final.tot/final.cnt
Am I missing something? This seems insanely complicated... I would rather write a for loop that processes a CSV line by line than deal with this. There has to be a better way.
I think you could do something like the following which seems a bit simpler to me. I made the following data:
id,val
A,2
A,5
B,4
A,2
C,9
A,7
B,6
B,1
B,2
C,4
C,4
A,6
A,9
A,10
A,11
C,12
A,4
A,4
B,6
B,5
C,7
C,8
B,9
B,10
B,11
A,20
I'll do chunks of 5:
chunks = pd.read_csv("foo.csv",chunksize=5)
pieces = [x.groupby('id')['val'].agg(['sum','count']) for x in chunks]
agg = pd.concat(pieces).groupby(level=0).sum()
print agg['sum']/agg['count']
id
A 7.272727
B 6.000000
C 7.333333
Compared to the non-chunk version:
df = pd.read_csv('foo.csv')
print df.groupby('id')['val'].mean()
id
A 7.272727
B 6.000000
C 7.333333

Categories

Resources