I am a newbie to python but find that i really like how pandas works. I have two identifiers in a column: C, NC. stands for Core and NonCore. I want to get the mean of the core only. The below will remove the NC, but it keeps it out of the final dataframe.
Full Test Code below:
import pandas
import numpy as np
usecols = ['Arcade','Assetnum','Core Or Noncore','Theo Win Per Day Amt']
df = pd.read_csv('test_data.csv',index_col=False, sep=',', thousands=',',na_values=['N/A'])[usecols]
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.fillna(0, inplace=True)
df.replace(np.nan, 0, inplace=True)
df['Theo Win Per Day Amt'] = df['Theo Win Per Day Amt'].round(2)
df.insert(4, 'ARC_TWPD_AVG', 0)
new_names = {'Assetnum':'Asset','Theo Win Per Day Amt':'TWPD',}
df.rename(columns=new_names, inplace=True)
arcades = df['Arcade'].drop_duplicates()
for a in arcades:
ARC_TWPD_AVG = df[df['Arcade'] == a]['TWPD'].mean()
df.loc[df['Arcade'] == a,'ARC_TWPD_AVG'] = (df.loc[df['Arcade'] == a,'TWPD']).mean()
writer = pd.ExcelWriter('test_data.xlsx')
df.to_excel(writer, index = False)
writer.save()
I dont know how to attach a json file to this.
You just need to filter the dataframe and then take the mean:
df[df['C/NC'] == 'C']['COLUMN'].mean()
I'm trying to read a file that doesn't have any quotes, which is causing inconsistent number of row lengths
Data looks as follows:
col_a, col_b
abc, inc., 5
xyz corb, 10
Since there are no quotes around "abc, inc.", this is causing the first row to get split into 3 values, but it should actually be just 2 values.
This column is not necessarily in the first position, and that there can be another bad column like this. The data has around 250 columns.
I'm reading this using pd.read_csv, how can this be resolved?
Thanks!
Its not a CSV but since there is only one column with the errant commas you can process with the csv module and fix the slice that holds too many column values. When a row has too many cells, assume they are the ones from the unescaped comma.
import pandas as pd
import csv
def split_badrows(fileobj, bad_col, total_cols):
"""Iterate rows, colapsing extra columns at bad_col"""
for row in csv.reader(fileobj):
row = [cell.strip() for cell in row]
extras = len(row) - total_cols
if extras > 0:
# colapse slice at troubled column into single value
extras += 1 # python slice doesn't include right endpoint
row[bad_col] = ", ".join(row[bad_col:bad_col+extras])
del row[bad_col+1:bad_col+extras]
yield row
def df_from_badtext(fileobj, bad_col):
"""Make pandas.DataFrame from badly formatted text"""
columns = [cell.strip() for cell in next(fileobj).split(",")]
total_cols = len(columns)
return pd.DataFrame(split_badrows(fileobj, bad_col, total_cols),
columns=columns)
# test
open("testme.txt", "w").write("""col_a, col_b
abc, inc., 5
xyz corb, 10""")
df = df_from_badtext(open("testme.txt"), bad_col=0)
print(df)
Data split to list then transform to dataframe.
csv = '''col_a, col_b
abc, inc., 5
xyz corb, 10'''+'\n'
import re
import pandas as pd
reArr = re.findall('(.*),([^,]+)\n',csv)
df=pd.DataFrame(reArr[1:],columns=reArr[0])
print(df)
col_a
col_b
0
abc, inc.
5
1
xyz corb
10
EDIT:
Thanks to tdelaney comment below:
see if this works
pd.read_csv('foo.csv',delimiter=",(?!( [\w\d]*).,)").dropna(axis=1)
OLD:
using delimiter as ",(?!.*,)" in read_csv seems to be solving this for me
EDIT (after updated question with an additional column):
Solution 1:
You can create a function with the bad column as a parameter and use split and concat to correct the dataframe depending on that bad column. Please note that the bad_col parameter in my function is the column number, where we start counting at 1, rather than 0 (e.g. 1, 2, 3, etc. instead of 0, 1, 2, etc.):
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col, col_a, col_b
000, abc, inc., 5
111, xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
def fix_csv(df, bad_col):
cols = df.columns.str.split(', ')[0]
x = len(cols) - bad_col
tmp = df.iloc[:,0].str.split(', ', expand=True, n=x)
df = pd.concat([tmp.iloc[:,0],
tmp.iloc[:,-1].str.rsplit(', ', expand=True, n=x)],
axis=1)
df.columns = cols
return df
fix_csv(df, bad_col=2)
Solution 2 (this is if you have issues in multiple columns and you need to use more brute force):
It sounds like there is a possibility that you there could be multiple columns affected from the comments as you mentioned only 1 "so far".
As such, this might be a little bit of a project to clean up the data. The following code can give you an idea how to do that. The bottom-line is that you can create two different dataframes: 1) The first dataframe has the minimum number of commas (i.e. they should be the rows without any issues). 2) The other dataframe will be the dataframe with all of the issues. I've shown how you can clean the data to get to the correct number of columns and then change the data back and concat the two dataframes.
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col, col_a, col_b
000, abc, inc., 5
111, xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
cols = df.columns.str.split(', ')[0]
s = df.iloc[:,0].str.count(',')
df1 = df.copy()[s.eq(s.min())]
df1 = df1.iloc[:,0].str.split(', ', expand=True)
df1.columns = cols
df2 = df.copy()[s.gt(s.min())]
#inspect this dataframe manually to see how many rows affected, which columns, etc.
#cleanup df2 with some .replace so all equal commas
original = [', inc.', ', corp.']
temp = [' inc.', ' corp.']
df2.iloc[:,0] = df2.iloc[:,0].replace(original, temp, regex=True)
df2 = df2.iloc[:,0].str.split(', ', expand=True)
df2.columns = cols
#cleanup df2 by changing back to original values
df2['col_a'] = df2['col_a'].replace(temp, original, regex=True) # you can do this with other columns as well
df3 = pd.concat([df1, df2]).sort_index()
df3
Out[1]:
col col_a col_b
0 000 abc, inc. 5
1 111 xyz corb 10
Solution 3: Previous Solution (for original question when problem was only in first column - for reference)
You can read in with sep="|" as that | character is not in your .csv, so it reads all of the data into one column.
The main assumption to my solution is that the problematic column is only the first column. I use rsplit(', ') and limit the number of splits to the total number of columns minus 1 (with the example data, this is 2-1=1). Hopefully, this solves with your actual data or at least gives you some idea. If your data is separated by , instead of , , please note whether or not to adjust my splits as well.
import pandas as pd
import numpy as np
from io import StringIO
data = StringIO('''
col_a, col_b
abc, inc., 5
xyz corb, 10
''')
df = pd.read_csv(data, sep="|")
cols = df.columns.str.split(', ')[0]
x = len(cols) - 1
df = df.iloc[:,0].str.rsplit(', ', expand=True, n=x)
df.columns = cols
df
Out[1]:
col_a col_b
0 abc, inc. 5
1 xyz corb 10
This is my DataFrame:
d = {'col1': ['sku 1.1', 'sku 1.2', 'sku 1.3'], 'col2': ['9.876.543,21', 654, '321,01']}
df = pd.DataFrame(data=d)
df
col1 col2
0 sku 1.1 9.876.543,21
1 sku 1.2 654
2 sku 1.3 321,01
Data in col2 are numbers in local format, which I would like to convert into:
col2
9876543.21
654
321.01
I tried df['col2'] = pd.to_numeric(df['col2'], downcast='float'), which returns a ValueError: : Unable to parse string "9.876.543,21" at position 0.
I tried also df = df.apply(lambda x: x.str.replace(',', '.')), which returns ValueError: could not convert string to float: '5.023.654.46'
The best is use if possible parameters in read_csv:
df = pd.read_csv(file, thousands='.', decimal=',')
If not possible, then replace should help:
df['col2'] = (df['col2'].replace('\.','', regex=True)
.replace(',','.', regex=True)
.astype(float))
You can try
df = df.apply(lambda x: x.replace(',', '&'))
df = df.apply(lambda x: x.replace('.', ','))
df = df.apply(lambda x: x.replace('&', '.'))
You are always better off using standard system facilities where they exist. Knowing that some locales use commas and decimal points differently I could not believe that Pandas would not use the formats of the locale.
Sure enough a quick search revealed this gist, which explains how to make use of locales to convert strings to numbers. In essence you need to import locale and after you've built the dataframe call locale.setlocale to establish a locale that uses commas as decimal points and periods for separators, then apply the dataframe's applymapp method.
I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/
How can I copy a DataFrame to_clipboard and paste it in excel with commas as decimal?
In R this is simple.
write.table(obj, 'clipboard', dec = ',')
But I cannot figure out in pandas to_clipboard.
I unsuccessfully tried changing:
import locale
locale.setlocale(locale.LC_ALL, '')
Spanish_Argentina.1252
or
df.to_clipboard(float_format = '%,%')
Since Pandas 0.16 you can use
df.to_clipboard(decimal=',')
to_clipboard() passes extra kwargs to to_csv(), which has other useful options.
There are some different ways to achieve this. First, it is possible with float_format and your locale, although the use is not so straightforward (but simple once you know it: the float_format argument should be a function that can be called):
df.to_clipboard(float_format='{:n}'.format)
A small illustration:
In [97]: df = pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
In [98]: df
Out[98]:
A B
0 1.125438 -1.015477
1 0.900816 1.283971
2 0.874250 1.058217
3 -0.013020 0.758841
4 -0.030534 -0.395631
In [99]: df.to_clipboard(float_format='{:n}'.format)
gives:
A B
0 1,12544 -1,01548
1 0,900816 1,28397
2 0,87425 1,05822
3 -0,0130202 0,758841
4 -0,0305337 -0,395631
If you don't want to rely on the locale setting but still have comma decimal output, you can do this:
class CommaFloatFormatter:
def __mod__(self, x):
return str(x).replace('.',',')
df.to_clipboard(float_format=CommaFloatFormatter())
or simply do the conversion before writing the data to clipboard:
df.applymap(lambda x: str(x).replace('.',',')).to_clipboard()