Ignoring multiple commas while reading csv in pandas

Ignoring multiple commas while reading csv in pandas - python

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks

Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.

Related

Manipulate string in python (replace string with part of the string itself)

So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:

To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.

If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991

You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Extracting multiple URLs - Python

I want to extract several links from a text (comments), which are stored in a panda dataframe. My goal is to add the extracted URLs in a new column of the original dataset. By using the following method applied to my text, I am able to extract the comments and store it in the variable URL and transform it into another pandas dataframe. In this case, I am not sure if this is the efficient way to extract the necessary information.
URL = (ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
URL_df = pd.DataFrame(data=URL)
URL_df.drop([1],axis=1)
gives me:
596 https://www.tag24.de/nachrichten
596 http://www.tt.com/panorama
596 http://www.wz.de/lokales
666 https://www.svz.de/regionales
666 https://www.watson.ch/Leben
... ...
The dataframe contains only the indices and the hyperlinks. The problem with this method is, that some of the indices are duplicated because one comment can exist several URLs, which will be extracted. I tried different ways to solve this problem such as:
pd.concat([ALL, URL_df.drop], axis=1).sort_index()
I also tried to store it the URLs directly to the original dataframe by applying:
ALL['URL'] = ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
but I only received this error message:
"incompatible index of the inserted column with frame index"
As I said before my goal is to store the extracted URLs in a new column like:
text URL
"blablabla link1, link2, link3" [https://www.tag24.de/nachrichten, http://www.tt.com/panorama, http://www.wz.de/lokales]
"blablabla link1, link2" [https://www.svz.de/regionales, https://www.watson.ch/Leben]
... ...

I think need findall:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.findall(pat)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa... ]
Another solution with extractall, which return MultiIndex, so necessary groupby by first level with creating lists:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.extractall(pat).groupby(level=0)[0].apply(list)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa...
Setup:
ALL = pd.DataFrame({'textOriginal': ['https://www.tag24.de/nachrichten http://www.tt.com/panorama http://www.wz.de/lokales', 'https://www.svz.de/regionales https://www.watson.ch/Leben']})

Let's say you have a dataframe with two columns, 'Indice' and 'Link', where Indice is not unique. You can aggregate all the links with the same Indice in the following way:
myAggregateDF = myDF.groupby('Indice')['Link'].apply(list).to_frame()
In this way, you will obtain a new dataframe with two columns, 'Indice' and 'Link', where 'Link' is a list of the previous links.
Pay attention though, this method is not efficient. Groupby is memory hungry and this can be a problem with large dataframes.

pandas groupby is returning two groups for the same unique id

I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,

In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)

Pandas: query string where column name contains special characters

I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.

With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)

For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction

The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']

pandas.read_csv moves column names over one

I am using the ALL.zip file located here. My goal is to create a pandas DataFrame with it. However, if I run
data=pd.read_csv(foo.csv)
the column names do not match up. The first column has no name, and then the second column is labeled with the first, and the last column is a Series of NaN. So I tried
colnames=[list of colnames]
data=pd.read_csv(foo.csv, names=colnames, header=False)
which gave me the exact same thing, so I ran
data=pd.read_csv(foo.csv, names=colnames)
which lined the colnames up perfectly, but had the csv assigned column names(the first line in the csv document) perfectly aligned as the first row of data it. So I ran
data=data[1:]
which did the trick.
So I found a work around without solving the actual problem. I looked at the read_csv document and found it a bit overwhelming, and could not figure out a way using only pd.read_csv to fix this problem.
What was the fundamental problem (I am assuming it is either user error or a problem with the file)? Is there a way to fix it with one of the commands from the read_csv?
Here is the first 2 rows from the csv file
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",

It's not the column that you're having a problem with, it's the index
import pandas as pd
df = pd.read_csv('P00000001-ALL.csv', index_col=False, low_memory=False)
print(df.head(1))
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00458844 P60006723 Rubio, Marco HEFFERNAN, MICHAEL APO
contbr_st contbr_zip contbr_employer \
0 AE 090960009 INFORMATION REQUESTED PER BEST EFFORTS
contbr_occupation contb_receipt_amt contb_receipt_dt \
0 INFORMATION REQUESTED PER BEST EFFORTS 210 27-JUN-15
receipt_desc memo_cd memo_text form_tp file_num tran_id election_tp
0 NaN NaN NaN SA17A 1015697 SA17.796904 P2016
The low_memory=False is because column 6 has mixed datatype.

The problem comes from having every line in the file except for the first terminating in a comma (the separator character). Pandas thinks there's an empty column there if it needs to consider the first 'column name' as the index column.
Try
data= pd.read_csv('P00000001-AL.csv',index_col=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.